OpenZiti bulk data throughput?

I'm evaluating OpenZiti for a project I'm working on where I need to be able to send a large amount of data through the network. I've read some previous discourse threads on OpenZiti performance and they seem to be from 2023 or older or have to do more with fast connection establishment rather than bulk throughput. I also I figure some changes have happened since these prior threads were posted.

To try to evaluate the throughput, I've setup a Ziti network consisting of two routers in tproxy mode and a controller. The routers and controllers have the control plane run on a slow network and the datapath goes on a private network. The controller and routers each run in an Ubuntu 24.0.1LTS VM with 8 CPUs and 4GB of RAM running on ESXi. The private network for the data is a private vswitch with no uplink and all the VMs are running on the same host. So the latency is low and the throughput is high on the private network.

If I run iperf using TCP and 128k updates through the private network between the router nodes (with iperf server/clients running on the router hosts themselves) I get something like ~25Gbits/s with just the "bare" private network and no tunneling.

This is obviously not going to be representative of something that has to do encryption and tunneling so, I setup stunnel between the two router nodes to see what something that tunnels data using TLS would look like on this same network. So I send the iperf traffic through the stunnel and I get a little under ~5Gbits/s.

At this point, I figure that if I send traffic through the same network between two OpenZiti routers, I should see performance somewhere in this ballpark since, ultimately, the routers in tproxy mode should be doing something similar to stunnel by proxying my data over an SSL link between each other.

However, when I setup an OpenZiti service, I get only about 700Mbit/s. Asking iperf to use 3 connections I can get up to ~1Gbit/s combined throughput and any additional connections don't raise the combined throughput, they just get a diminishing share of the ~1Gbit/s bandwidth as more connections are added.

So I took a look at some of the discourse threads and I tried applying the suggested sysctl settings to raise the system socket buffer sizes to 16MB. This didn't seem to do anything. I saw that perhaps "realistic latency" might be important...so I tried using tc qdisc on the nodes to add 10ms of latency in each direction between the routers (for an RTT of 20ms) and this didn't do anything but lower the overall performance even further. Though, to be fair it also lowered it for stunnel and the "bare" network performance, though, I think this can be recovered by asking iperf to raise the socket buffer size.

I also saw that there are buffer adjustements I can configure on the routers in OpenZiti like the txPortal* and rxBufferSize settings. I raised these to basically the systemctl max (16MB) and increased txBuffers to 128. I added these to all the "options:" sections in the ziti-router config YAML and restarted the routers. Any adjustments I made with these settings just seemed to make perf get worse, if anything.

Looking at CPU, with one connection through OpenZiti I see the ziti-router process is using a little over 200% CPU according to top. If I push the number of connections up to get to 1Gbit/s, it's using 350% CPU on the tx side and ~500% CPU on the rx side according to top. There is not a lot of memory usage. So nothing is starved for CPU or RAM, as far as I can tell.

Digging further, I turned on debug logging and when I ran iperf I noticed a lot of this sort of logging on the router doing the tx:

Nov 20 18:23:34 ubuntuguest ziti[14825]: {"_context":"{c/wU281QBQ6|@/ZVEK}\u003cInitiator\u003e","circuitId":"wU281QBQ6","file":"github.com/openziti/ziti/router/xgress/xgress.go:799","func":"github.com/openziti/ziti/router/xgress.(*Xgress).sendUnchunkedBuffer","level":"debug","msg":"forwarded [10.3 kB]","origin":0,"seq":1829,"time":"2024-11-20T18:23:34.083Z"}

And on the rx router side I'm seeing:
Nov 20 18:23:36 ubuntuguest ziti[14356]: {"_context":"{c/wU281QBQ6|@/ZaN4}\u003cTerminator\u003e","circuitId":"wU281QBQ6","file":"github.com/openziti/ziti/router/xgress/xgress.go:453","func":"github.com/openziti/ziti/router/xgress.(*Xgress).tx.func2","level":"debug","msg":"payload 2669 of size 10257 removed from rx buffer, new size: 0","origin":0,"seq":2669,"time":"2024-11-20T18:23:36.512Z"}

Is the router really ingesting the traffic 10k at a time and sending 10k requests on the overlay? If so, I could see that tremendous syscall overhead of doing that being a problem. I would expect it to converge on a buffer size closer to the 128k that iperf is sending at the very least?

I'm no golang expert but looking at the source, I see the default "Mtu" setting is "0" which seems to make the router read as much buffer as it can from the underlying system, right? So is this a go IO library problem where it is only giving the router code 10k at a time from this continuous incoming stream or is the router code only asking for such a tiny buffer on its own somehow?

Digging deeper with strace on the Tx side I also see this sort of thing going on:
1227 1732131703.617182 read(13, <unfinished ...>
1226 1732131703.617214 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
1227 1732131703.617241 <... read resumed>""..., 3456) = 1257
1227 1732131703.617301 futex(0x40000a7648, FUTEX_WAKE_PRIVATE, 1) = 1
1229 1732131703.617355 <... futex resumed>) = 0
1226 1732131703.617363 <... nanosleep resumed>NULL) = 0
1229 1732131703.617387 epoll_pwait(4, <unfinished ...>
1226 1732131703.617394 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
1229 1732131703.617408 <... epoll_pwait resumed>, 128, 0, NULL, 0) = 0
1227 1732131703.617416 read(15, <unfinished ...>
1229 1732131703.617454 futex(0x40004a8b48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
1227 1732131703.617464 <... read resumed>""..., 10240) = 10240
1229 1732131703.617486 <... futex resumed>) = 1
1246 1732131703.617539 <... futex resumed>) = 0
1227 1732131703.617558 write(12, ""..., 10342 <unfinished ...>

Which looks like reading data, waiting 20usecs, then reading 15 more bytes, reading 10k more after a further (locking related?) delay, and then writing out 10k plus. I think the 20usecs may be a golang polling library thing, but if there's only such small reads this means there's 2x syscall overhead plus multiple 20usec poll overheads every logical read from the iperf client. If larger reads were happening the 20usec polling might actually be better, but with such small reads, it's just more overhead?

After looking at all of this I also stumbled on this:

Which seems to be saying that the best I can expect out of an 8 CPU setup is something like 500Mbits/s combined throughput? Is this currently true? Am I barking up the wrong tree trying to get more than 500Mbit/s of throughput out of OpenZiti or did I just not configure larger IO settings properly?

Any help here would be appreciated!

Thanks!

Shooting from the hip, you are not barking up the wrong tree. The sizing guide is deliberately conservative. I know of a couple of deployments/tests with some performance benchmarks:

  • One organisation is using OpenZiti for backup, moving data between on-prem Netapp to AWS with 6.5gigabits/sec throughput (using 4 Edge Routers sized with C5n.xlarge (4 vCPUs, 10.5 GiB of memory and up to 25 Gibps of bandwidth) - i.e., 1.625 gigabits/sec per edge router. This is a real-world connection that has some latency and loss.
  • I know one of our engineers setup a connection test (i.e., no latency and loss) with 2 vCPUs, 4.0 GiB and achieve a throughput of 145MB/s (or 1.16 Gb/s).

I will, however, leave it to the people far more qualified than me to suggest how best to set up for larger throughputs.

I guess what I'm wondering is whether there is some config tuning that I'm missing that will get me closer to 3-5Gigabits/s through a single router?

Is this is tunables thing, or is ~1-2gigabit/s the effective max per router without OpenZiti (or golang lib) code changes?

We've been looking into this for some time, making improvements along the way. Some of the information you've posted, which is fantastic and we thank you, is best evaluated by someone who is OOO right now, so there may be some delay in response at that level.

The payloads are ziti messages, which are broadly analogous to packets or frames, they are a unit of data that figures into the flow control mechanisms They are limited in size to balance the streams between being able to recover from any losses and overall performance, similar to TCP. Flow control has recently been a major point of investigation, as we have found some misbehaviors in the extraction of messages from the buffers being out of order, and changes were made to resolve that relatively recently. The out of order behavior caused some thrashing and inefficiency in the flow control process limiting throughput.

There are a few other avenues of investigation that are ongoing as well, looking at the specific functions in different architectures, tunnelers or not, low latency, higher latency, intermediate fabric router, etc. There are a lot of variables, and we don't want to fix one at the expense of others.

All that said, we are certainly working hard to improve the overall throughput of the system, and would appreciate any information you could add from your own testing. The Dev team will take a look at the details of what you've given us above once they are available, with the holiday coming up here in the US, it may be a bit delayed, but we wanted to let you know we really appreciate the feedback, and will be giving some attention.

Hi @mr_z

That 10k buffer issue you've pinpointed is something that probably needs to be made into a tunable. I did write code at one point to try and make it auto-tuning. The change I made was to adjust the buffer size on the fly based on if reads filled the buffer. If they were, it would increase the buffer size. If we weren't filling the buffer, we'd shrink the buffer, with a little bit of slack so that it would hopefully stabilize a size slightly larger than what reads were returning.

See change here: Ensure buffers are big enough for UDP datagrams. Fixes #1329 · openziti/ziti@e68f655 · GitHub

The test results are here.

In most cases the adaptive buffer sizing made perf worse, but in a an outlier case it almost doubled it. The takeaway was that even if we can manage larger buffer sizes at the ingest, those larger payloads run into fragmentation issues on longer/lossier routes and slow things down.

If you're interested in testing alternate buffer sizes in your environment, I can probably add those easily.

Other than that we're due for another round of performance testing/tuning, but it'll likely come after the HA work is wrapped up. I did recently do some perf work around DTLS links and implemented a lower-overhead link level protocol. That's currently only enabled for DTLS links, since there's cross-router version compatibility concerns, but that's another avenue for some extra throughput. That's likely only in the 10-20% range, though.

There was also another issue which was causing excessive re-transmits because of missing back-pressure. That was found and fixed during the DTLS testing. I'm curious if the adaptive buffering would work better now that that's fixed.

Sorry, bit of a scattered response there. Let me know if you're interested in the testing the adaptive buffering or getting a tunable for buffer sizes.

Other than that, last time I was doing perf testing, I was also testing with some alternate flow control settings. Here are the ones I was using:

      txPortalStartSize: 4192000
      txPortalIncreaseThresh: 250
      txPortalIncreaseScale: 0.5
      txPortalRetxThresh: 50
      txPortalRetxScale: 0.9
      retxScale: 1.05
      txPortalIncreaseThresh: 256
      txPortalDupAckThresh: 5

Paul

1 Like