It looks like the private router isn't even trying to form a link, which I'm guessing means it doesn't know about the listener. Can you post the startup logs for the public router and/or check the output of ziti fabric list routers? The router list should show listeners for connected routers, something like:
[51824.513] INFO ziti/router/link.(*linkRegistryImpl).evaluateLinkState: {key=[lo->tls:5uUxuQ3u6Q->lo]} queuing link to dial
[51824.513] INFO ziti/router/link.(*linkRegistryImpl).evaluateLinkState.func1: {key=[lo->tls:5uUxuQ3u6Q->lo]} dialing link
In the private router if it knew about the link listener.
You can also do ziti fabric inspect links, which will give you some information about what the state of each router's links.
Thanks for the troubleshooting tips. The ziti fabric list routers made me realize my mistake: My public router was still running 0.27.5 from my tests with older versions. After upgrading it to 0.31.0, everything worked as expected. How embarrassing ...
I just re-ran the timing tests through the router: I often see 6-7 second reloads (so it seems 0.31 is faster than 0.27), but also some with 12 second hickups
On the 'internal' router I also see Recv-Q's being filled by traffic from the source, but on the public router everything seems to be fine. No Recv-Q data on the fabric link.
Should I get the logs from the 0.31 release, or do you think the 0.27 stacks should be enough to get an idea?
And people started complaining that some services were not available. I think I've managed to get a stack dump of this situation using ziti fabric inspect stackdump. I'm now on 0.31.2, and it seems that ziti agent list is not showing up the routers... ?
I've wrapped up the other things I was working on, so today I'm digging into this. I'm going to look into the captures you sent first, then work on setting up a lab environment to try and reproduce this. Your lab setup sounded pretty simple:
First test was:
controller and public router on one node
ziti-edge-tunnel and some host application same node as ctrl/router
Client running MacOS tunneler
Followup testing was similar but with an edge-router/tunneler in place of the ziti-edge-tunnel?
Was the client running on the same network as everything else, or was it running elsewhere? If elsewhere, how far? us-east to us-west or closer or farther apart?
Looking at the applications you listed, would a HTTP application that was sending lots of requests of various sizes as fast as possible sound like it would trigger the stall? I was thinking of using something like Bombadier.
If you send me the stackdumps, I'll take a look. How does the 'service not available' manifest? Are there errors in the client tunnelers?
ziti agent list isn't showing routers running on the local box? Or is it not showing remote routers? ziti agent only works on local processes. It communicates to the processes over unix sockets, so you have to run it as a user that can access the unix socket file. We often run into the issue where someone is running ziti router as root when using the tunneler, and then not running ziti agent with sudo and not being able to see the router.
I'll start with digging through what you've already sent, but if you could let me know how the proposed test setup sounds, that would be appreciated!
Yes, this describes my test setup quite well. For the edge-router/tunneler tests, I first used the existing public router in combination with a service reachable in the 'public dmz'. I deployed everything in a Kubernetes cluster, but that shouldn't make a difference.
The lab is in our DC, and my osx client is at my 'home office'. The latency router<>client is ~15-20 ms, Bandwith 250Mbit down/60up. I'm from germany, and all things are located within europe
I also see it for single streams transferring large files, like an S3 transfer / download. I've also seen browsers - when HTTP/2 is working - multiplex many requests over the same TCP session. In my tests with nextcloud, I've seen only one TCP session for most tests. I think the key is to having a server that can serve something like a 50 or 100 MB file very quickly to trigger the problem.
Some good news. Digging back into the initial traces you had sent, I saw one clear anomaly which led me to a fix for a stall. What I saw was this in a circuit inspect:
It was indicating that it wasn't reading any more data in (blockedByLocalWindow: true), but the linkSendBufferSize was 0, so that shouldn't have been the case. In the code, I found that we were checking the window size against the send buffer size before we did some ack processing, so we could think that we shouldn't read more data in, but then acks would clear the send buffer, and we wouldn't recheck it before waiting for the next thing to trigger processing.
I happened to be doing a release today, so the fix for this is in OpenZiti v0.31.3.
I checked the circuit dumps from your test using the edge-router/tunneler and I didn't see evidence of the same issue, unfortunately. So it's likely that there's another issue. What I see in those circuits is that the window size seems to stay at the minimum for a long time. I'm wondering if we need to raise the initial window size. I need to do some more digging, thinking and probably some experimenting here. I'll likely send you a follow up with some information on how you can try some different flow control parameters yourself, and see what makes a difference.
It looks like what's happening for bigger/faster transfers is just a very slow ramp in window size. In the data you sent, the window size stays at minimum for about 12 seconds. There's definitely something odd going on there. Especially in a single router, the feedback loop should be much faster. I'll probably be experimenting with this for a few days, and I'll let you know what I find.
The fix has been merged up and will be in the next release. I'm going to push that out today. I really appreciate the extra effort you did to document the issue and capture some really high quality data to show it in action. It made them much easier to track down.
Once you're able to test with v0.31.4, let us know if the issues are resolved or if there are some things we still need to chase down.
Yes, release-next is where the next release is staged. So once the release PR is approved, release-next will get merged to main, which will trigger a release.
I have suspicion about what that could be about, let me check. Although, the smoke tests do run a variety of data flow tests and I don't think we have posture checks for the smoke tests.
That's great to hear! Thank you for all your help. Also, 'fasten your seatbelts' is also common in english
I think I know what's going on with the posture check situation. The smoketest doesn't include desktop tunnelers, and I suspect a posture check/service lookup optimization broke them. Hopefully should have a fix soon, hopefully tomorrow, included in the 0.31.4 release.