I'm evaluating OpenZiti for a project I'm working on where I need to be able to send a large amount of data through the network. I've read some previous discourse threads on OpenZiti performance and they seem to be from 2023 or older or have to do more with fast connection establishment rather than bulk throughput. I also I figure some changes have happened since these prior threads were posted.
To try to evaluate the throughput, I've setup a Ziti network consisting of two routers in tproxy mode and a controller. The routers and controllers have the control plane run on a slow network and the datapath goes on a private network. The controller and routers each run in an Ubuntu 24.0.1LTS VM with 8 CPUs and 4GB of RAM running on ESXi. The private network for the data is a private vswitch with no uplink and all the VMs are running on the same host. So the latency is low and the throughput is high on the private network.
If I run iperf using TCP and 128k updates through the private network between the router nodes (with iperf server/clients running on the router hosts themselves) I get something like ~25Gbits/s with just the "bare" private network and no tunneling.
This is obviously not going to be representative of something that has to do encryption and tunneling so, I setup stunnel between the two router nodes to see what something that tunnels data using TLS would look like on this same network. So I send the iperf traffic through the stunnel and I get a little under ~5Gbits/s.
At this point, I figure that if I send traffic through the same network between two OpenZiti routers, I should see performance somewhere in this ballpark since, ultimately, the routers in tproxy mode should be doing something similar to stunnel by proxying my data over an SSL link between each other.
However, when I setup an OpenZiti service, I get only about 700Mbit/s. Asking iperf to use 3 connections I can get up to ~1Gbit/s combined throughput and any additional connections don't raise the combined throughput, they just get a diminishing share of the ~1Gbit/s bandwidth as more connections are added.
So I took a look at some of the discourse threads and I tried applying the suggested sysctl settings to raise the system socket buffer sizes to 16MB. This didn't seem to do anything. I saw that perhaps "realistic latency" might be important...so I tried using tc qdisc on the nodes to add 10ms of latency in each direction between the routers (for an RTT of 20ms) and this didn't do anything but lower the overall performance even further. Though, to be fair it also lowered it for stunnel and the "bare" network performance, though, I think this can be recovered by asking iperf to raise the socket buffer size.
I also saw that there are buffer adjustements I can configure on the routers in OpenZiti like the txPortal* and rxBufferSize settings. I raised these to basically the systemctl max (16MB) and increased txBuffers to 128. I added these to all the "options:" sections in the ziti-router config YAML and restarted the routers. Any adjustments I made with these settings just seemed to make perf get worse, if anything.
Looking at CPU, with one connection through OpenZiti I see the ziti-router process is using a little over 200% CPU according to top. If I push the number of connections up to get to 1Gbit/s, it's using 350% CPU on the tx side and ~500% CPU on the rx side according to top. There is not a lot of memory usage. So nothing is starved for CPU or RAM, as far as I can tell.
Digging further, I turned on debug logging and when I ran iperf I noticed a lot of this sort of logging on the router doing the tx:
Nov 20 18:23:34 ubuntuguest ziti[14825]: {"_context":"{c/wU281QBQ6|@/ZVEK}\u003cInitiator\u003e","circuitId":"wU281QBQ6","file":"github.com/openziti/ziti/router/xgress/xgress.go:799","func":"github.com/openziti/ziti/router/xgress.(*Xgress).sendUnchunkedBuffer","level":"debug","msg":"forwarded [10.3 kB]","origin":0,"seq":1829,"time":"2024-11-20T18:23:34.083Z"}
And on the rx router side I'm seeing:
Nov 20 18:23:36 ubuntuguest ziti[14356]: {"_context":"{c/wU281QBQ6|@/ZaN4}\u003cTerminator\u003e","circuitId":"wU281QBQ6","file":"github.com/openziti/ziti/router/xgress/xgress.go:453","func":"github.com/openziti/ziti/router/xgress.(*Xgress).tx.func2","level":"debug","msg":"payload 2669 of size 10257 removed from rx buffer, new size: 0","origin":0,"seq":2669,"time":"2024-11-20T18:23:36.512Z"}
Is the router really ingesting the traffic 10k at a time and sending 10k requests on the overlay? If so, I could see that tremendous syscall overhead of doing that being a problem. I would expect it to converge on a buffer size closer to the 128k that iperf is sending at the very least?
I'm no golang expert but looking at the source, I see the default "Mtu" setting is "0" which seems to make the router read as much buffer as it can from the underlying system, right? So is this a go IO library problem where it is only giving the router code 10k at a time from this continuous incoming stream or is the router code only asking for such a tiny buffer on its own somehow?
Digging deeper with strace on the Tx side I also see this sort of thing going on:
1227 1732131703.617182 read(13, <unfinished ...>
1226 1732131703.617214 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
1227 1732131703.617241 <... read resumed>""..., 3456) = 1257
1227 1732131703.617301 futex(0x40000a7648, FUTEX_WAKE_PRIVATE, 1) = 1
1229 1732131703.617355 <... futex resumed>) = 0
1226 1732131703.617363 <... nanosleep resumed>NULL) = 0
1229 1732131703.617387 epoll_pwait(4, <unfinished ...>
1226 1732131703.617394 nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
1229 1732131703.617408 <... epoll_pwait resumed>, 128, 0, NULL, 0) = 0
1227 1732131703.617416 read(15, <unfinished ...>
1229 1732131703.617454 futex(0x40004a8b48, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
1227 1732131703.617464 <... read resumed>""..., 10240) = 10240
1229 1732131703.617486 <... futex resumed>) = 1
1246 1732131703.617539 <... futex resumed>) = 0
1227 1732131703.617558 write(12, ""..., 10342 <unfinished ...>
Which looks like reading data, waiting 20usecs, then reading 15 more bytes, reading 10k more after a further (locking related?) delay, and then writing out 10k plus. I think the 20usecs may be a golang polling library thing, but if there's only such small reads this means there's 2x syscall overhead plus multiple 20usec poll overheads every logical read from the iperf client. If larger reads were happening the 20usec polling might actually be better, but with such small reads, it's just more overhead?
After looking at all of this I also stumbled on this:
Which seems to be saying that the best I can expect out of an 8 CPU setup is something like 500Mbits/s combined throughput? Is this currently true? Am I barking up the wrong tree trying to get more than 500Mbit/s of throughput out of OpenZiti or did I just not configure larger IO settings properly?
Any help here would be appreciated!
Thanks!