How to track down throughput problems on the Ziti router

Hello,

It looks like the private router isn't even trying to form a link, which I'm guessing means it doesn't know about the listener. Can you post the startup logs for the public router and/or check the output of ziti fabric list routers? The router list should show listeners for connected routers, something like:

$ ziti fabric list routers
╭────────────┬───────────────┬────────┬──────┬──────────────┬──────────┬───────────────────────┬───────────────────────╮
│ ID         │ NAME          │ ONLINE │ COST │ NO TRAVERSAL │ DISABLED │ VERSION               │ LISTENERS             │
├────────────┼───────────────┼────────┼──────┼──────────────┼──────────┼───────────────────────┼───────────────────────┤
│ 38esMCa98K │ edge-router-3 │ false  │    0 │ false        │ false    │                       │                       │
│ 5uUxuQ3u6Q │ edge-router-2 │ true   │    0 │ false        │ false    │ v0.0.0 on linux/amd64 │ 1: tls:127.0.0.1:4024 │
│ LvEVacBGn  │ edge-router-4 │ false  │    0 │ false        │ false    │                       │                       │
│ tNlYWQ3upQ │ edge-router-1 │ true   │    0 │ false        │ false    │ v0.0.0 on linux/amd64 │ 1: tls:127.0.0.1:4023 │
╰────────────┴───────────────┴────────┴──────┴──────────────┴──────────┴───────────────────────┴───────────────────────╯
results: 1-4 of 4

I would expect to see messages like:

[51824.513]    INFO ziti/router/link.(*linkRegistryImpl).evaluateLinkState: {key=[lo->tls:5uUxuQ3u6Q->lo]} queuing link to dial
[51824.513]    INFO ziti/router/link.(*linkRegistryImpl).evaluateLinkState.func1: {key=[lo->tls:5uUxuQ3u6Q->lo]} dialing link

In the private router if it knew about the link listener.

You can also do ziti fabric inspect links, which will give you some information about what the state of each router's links.

$ ziti fabric inspect links
Results: (2)
tNlYWQ3upQ.links
destinations:
- healthy: true
  id: 5uUxuQ3u6Q
  linkStates:
  - connectedCount: 2
    dialAttempts: 1
    dialerBinding: lo
    dialerGroups:
    - default
    id: 15vxjhmtMCacbXjhpq5crS
    key: lo->tls:5uUxuQ3u6Q->lo
    nextDial: "0001-01-01T00:00:00Z"
    retryDelay: 0s
    status: established
    targetAddress: tls:127.0.0.1:4024
    targetBinding: lo
    targetGroups:
    - default
  version: v0.0.0
errors: null
links:
- dest: 5uUxuQ3u6Q
  destVersion: v0.0.0
  dialAddress: tls:127.0.0.1:4024
  dialed: true
  id: 15vxjhmtMCacbXjhpq5crS
  key: lo->tls:5uUxuQ3u6Q->lo
  protocol: tls
  split: true

5uUxuQ3u6Q.links
destinations:
- healthy: true
  id: tNlYWQ3upQ
  linkStates:
  - connectedCount: 2
    dialAttempts: 1
    dialerBinding: lo
    dialerGroups:
    - default
    id: O77bzCjD5ZwXBwYZZLsIY
    key: lo->tls:tNlYWQ3upQ->lo
    nextDial: "0001-01-01T00:00:00Z"
    retryDelay: 0s
    status: established
    targetAddress: tls:127.0.0.1:4023
    targetBinding: lo
    targetGroups:
    - default
  version: v0.0.0
errors: null
links:
- dest: tNlYWQ3upQ
  destVersion: v0.0.0
  dialAddress: tls:127.0.0.1:4024
  dialed: false
  id: 15vxjhmtMCacbXjhpq5crS
  key: lo->tls:tNlYWQ3upQ->lo
  protocol: tls
  split: true

Even if a link hasn't been made yet, if a router is attempting to form the link, it will show up in the inspect output.

Cheers,
Paul

Hi @plorenz ,

Thanks for the troubleshooting tips. The ziti fabric list routers made me realize my mistake: My public router was still running 0.27.5 from my tests with older versions. After upgrading it to 0.31.0, everything worked as expected. How embarrassing :see_no_evil: ...

Okay, this also means that the tests and logs collected in here How to track down throughput problems on the Ziti router - #18 by marvkis were done with the 0.27.5 router version.

Some quick updates:

  • I just re-ran the timing tests through the router: I often see 6-7 second reloads (so it seems 0.31 is faster than 0.27), but also some with 12 second hickups
  • On the 'internal' router I also see Recv-Q's being filled by traffic from the source, but on the public router everything seems to be fine. No Recv-Q data on the fabric link.

Should I get the logs from the 0.31 release, or do you think the 0.27 stacks should be enough to get an idea?

Thanks & Bye,
Chris

Hi @plorenz ,

Is there any news about this?
In the past, I've also seen the Recv-Q hang for minutes with no change in the numbers.

ESTAB  0       0       [::ffff:10.42.0.68]:3022  [::ffff:10.42.0.77]:46660 users:(("ziti",pid=2634457,fd=28))
ESTAB  0       0       [::ffff:10.42.0.68]:3022  [::ffff:10.42.0.77]:48024 users:(("ziti",pid=2634457,fd=10))
ESTAB  5841841 0       [::ffff:10.42.0.68]:3022  [::ffff:10.42.0.77]:54986 users:(("ziti",pid=2634457,fd=49))
ESTAB  0       0       [::ffff:10.42.0.68]:3022  [::ffff:10.42.0.77]:45588 users:(("ziti",pid=2634457,fd=12))
ESTAB  0       0       [::ffff:10.42.0.68]:3022  [::ffff:10.42.0.77]:48108 users:(("ziti",pid=2634457,fd=23))

And people started complaining that some services were not available. I think I've managed to get a stack dump of this situation using ziti fabric inspect stackdump. I'm now on 0.31.2, and it seems that ziti agent list is not showing up the routers... ?

Thanks & bye,
chris

Hi @marvkis ,

I've wrapped up the other things I was working on, so today I'm digging into this. I'm going to look into the captures you sent first, then work on setting up a lab environment to try and reproduce this. Your lab setup sounded pretty simple:

First test was:

  1. controller and public router on one node
  2. ziti-edge-tunnel and some host application same node as ctrl/router
  3. Client running MacOS tunneler

Followup testing was similar but with an edge-router/tunneler in place of the ziti-edge-tunnel?

Was the client running on the same network as everything else, or was it running elsewhere? If elsewhere, how far? us-east to us-west or closer or farther apart?

Looking at the applications you listed, would a HTTP application that was sending lots of requests of various sizes as fast as possible sound like it would trigger the stall? I was thinking of using something like Bombadier.

If you send me the stackdumps, I'll take a look. How does the 'service not available' manifest? Are there errors in the client tunnelers?

ziti agent list isn't showing routers running on the local box? Or is it not showing remote routers? ziti agent only works on local processes. It communicates to the processes over unix sockets, so you have to run it as a user that can access the unix socket file. We often run into the issue where someone is running ziti router as root when using the tunneler, and then not running ziti agent with sudo and not being able to see the router.

I'll start with digging through what you've already sent, but if you could let me know how the proposed test setup sounds, that would be appreciated!

Thank you,
Paul

Hi @plorenz ,

Sounds great!

Yes, this describes my test setup quite well. For the edge-router/tunneler tests, I first used the existing public router in combination with a service reachable in the 'public dmz'. I deployed everything in a Kubernetes cluster, but that shouldn't make a difference.

The lab is in our DC, and my osx client is at my 'home office'. The latency router<>client is ~15-20 ms, Bandwith 250Mbit down/60up. I'm from germany, and all things are located within europe :wink:

I also see it for single streams transferring large files, like an S3 transfer / download. I've also seen browsers - when HTTP/2 is working - multiplex many requests over the same TCP session. In my tests with nextcloud, I've seen only one TCP session for most tests. I think the key is to having a server that can serve something like a 50 or 100 MB file very quickly to trigger the problem.

The router is deployed within a kubernetes cluster using the helm chart from here: helm-charts/charts/ziti-router at main · openziti/helm-charts · GitHub Maybe the router just isn't allowed to create the socket. Will double check this.

If you have trouble reproducing it, just contact me.

Thanks,
Chris

Some good news. Digging back into the initial traces you had sent, I saw one clear anomaly which led me to a fix for a stall. What I saw was this in a circuit inspect:

    sendBufferDetail:
      ...skipped...
      blockedByLocalWindow: true
      blockedByRemoteWindow: false
      ...skipped...
      linkRecvBufferSize: 0
      linkSendBufferSize: 0

It was indicating that it wasn't reading any more data in (blockedByLocalWindow: true), but the linkSendBufferSize was 0, so that shouldn't have been the case. In the code, I found that we were checking the window size against the send buffer size before we did some ack processing, so we could think that we shouldn't read more data in, but then acks would clear the send buffer, and we wouldn't recheck it before waiting for the next thing to trigger processing.

I happened to be doing a release today, so the fix for this is in OpenZiti v0.31.3.

I checked the circuit dumps from your test using the edge-router/tunneler and I didn't see evidence of the same issue, unfortunately. So it's likely that there's another issue. What I see in those circuits is that the window size seems to stay at the minimum for a long time. I'm wondering if we need to raise the initial window size. I need to do some more digging, thinking and probably some experimenting here. I'll likely send you a follow up with some information on how you can try some different flow control parameters yourself, and see what makes a difference.

Cheers,
Paul

It looks like what's happening for bigger/faster transfers is just a very slow ramp in window size. In the data you sent, the window size stays at minimum for about 12 seconds. There's definitely something odd going on there. Especially in a single router, the feedback loop should be much faster. I'll probably be experimenting with this for a few days, and I'll let you know what I find.

Cheers,
Paul

Thanks to your second set of circuit captures, I was able to track down the second issue: If ack is received before payload is processed by link send buffer, a stall can result · Issue #1586 · openziti/ziti · GitHub

The fix has been merged up and will be in the next release. I'm going to push that out today. I really appreciate the extra effort you did to document the issue and capture some really high quality data to show it in action. It made them much easier to track down.

Once you're able to test with v0.31.4, let us know if the issues are resolved or if there are some things we still need to chase down.

Thanks again,
Paul

Hi @plorenz

Wow - I was out of the office all day and just discovered your posts. This sounds really cool. I'm going to give the new version a try right now.

Cheers,
Chris

Oh. It seems that 0.31.4 has not yet arrived on docker, but release-next is only an hour old. Is 0.31.4 in release-next?

@plorenz after upgrading from 0.31.2 to release-next (ziti -v reports 0.31.4) all services fail on posture checks:
Bildschirmfoto 2023-12-14 um 21.15.15
... I have no posture checks configured.
Will cross-check with 0.31.3

Yes, release-next is where the next release is staged. So once the release PR is approved, release-next will get merged to main, which will trigger a release.

I have suspicion about what that could be about, let me check. Although, the smoke tests do run a variety of data flow tests and I don't think we have posture checks for the smoke tests.

0.31.3 also suffers from this issue. So it seems to be introduced in this version. 0.31.2 is fine

Hi @plorenz

I currently have a setup with the controller on 0.31.2 and the router on 0.31.4 (release-next), and have repeated these tests:

And now - fasten your seatbelts (not sure if this is also an English saying):

The new times:
Through the router: 2.54s / 3.71s / 2.22s / 2.83s / 2.76s
Through the tunneler: 5.02s / 5.22s / 3.03s / 3.67s / 2.47s / 2.92s

Damn Paul, this is really fast now :tada: ! Great job :medal_sports:!

Thank you so much for fixing this!

:ziggy_rocks_closeup:

That's great to hear! Thank you for all your help. Also, 'fasten your seatbelts' is also common in english :slight_smile:

I think I know what's going on with the posture check situation. The smoketest doesn't include desktop tunnelers, and I suspect a posture check/service lookup optimization broke them. Hopefully should have a fix soon, hopefully tomorrow, included in the 0.31.4 release.

Cheers,
Paul

The v0.31.4 release is now out: Release Notes. Took a little longer than anticipated, was doing some extra testing.

1 Like