Performance question regarding Edge routers behind a NAT

Greetings!

I am working on a proof of concept for accessing data stored in the cloud, and cached on-prem, and was hoping to use OpenZiti to connect the two halves. In all of my previous deployments, the on-prem routers reach out to AWS from a network with no inbound firewall holes. However, their IPs are publicly routable. In a new environment that I just stood up, the on-prem routers are behind a NAT, using RFC 1918 (10.0.0.0/8) addresses. The issue I am having is extreme performance degradation. Is this likely to be happening because my new edge routers are reaching out from behind a NAT? The basic connectivity is there, but when I start to move data, the "Latency" displayed by the cli when I run 'ziti fabric list links' goes way high for links involving the new on-prem site. I decided to use iperf to test the connections by borrowing one router from the site (stopping the ziti process) and using it to run iperf, and having a static route setup on that host that uses the site's other OpenZiti edge router to reach a particular host in the cloud. I ran this same test at two different sites, and an existing site, near the new site, gave me between 200 and 500Mbits/sec across the country. The new site gives me less than 10Mbits/sec. However, if I open port 443 in AWS, and run iperf through it, from one of the routers behind the NAT, to the EIP of the target host (not going through OpenZiti) my results are again, in the hundreds of megabits per second, even from the site behind the NAT. In both sites, I am running sysctl.conf entries as follows (Thanks again for those! They provided a good boost):

net.ipv4.conf.all.forwarding = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.default.accept_source_route = 0
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 16777216
net.ipv4.udp_mem = 8388608 8388608 16777216
net.ipv4.tcp_retries2 = 8

The router version I am running is 1.1.15, as I have not had a chance to qualify newer versions to make sure they work with my automated deployment process.

Are you aware of anything off the top of your heads that I might be missing? Are there specific things I should look for in the logs to tell me if I can tune my way out of this issue? I am pretty much running a stock config that gets generated by the CLI command createEdgeRouterConfig, with minor tweaks like dialer/listener groups being added by a script. Does the output of that command change often between versions? Should I be downloading a new ziti-cli-functions.sh often?

Thanks!

A few questions just to understand the testing setup better.

  1. How long are you running the test to see the throughput?
  2. TCP or UDP?
  3. Are you using iperf in all cases?
  4. Are both end (initiating and terminating) edge routers behind NAT or not routable?
  5. You said you're running 1.1.15 routers, are all nodes 1.1.15?

This is certainly not expected, we see single flow throughput in the 100s of Mbps regularly in all environments.

There are some metrics which might be helpful in understanding. If you are not already collecting them, open up the metrics filters for xgress and link metrics. You can create or add the values to the controller's configuration, and then read them historically. Better if you have something to process them with, like Elastic or Influx, or you can use the Prometheus exported if that's more comfortable.

events:

  jsonLogger:
    subscriptions:
     - type: metrics
       metricFilter: "link.*"
      - type: metrics
       metricFilter: "xgress.*"
    handler:
      type: file
      format: json
      path: /var/log/ziti/utilization-metrics.log
      maxsizemb: 1024
      maxbackups: 1

If you are not already familiar, OpenZiti has its own flow control mechanism called xgress, which is a windowed flow control analogous to TCP in many ways. These metrics will show you if there are drops, blocked windows, etc. to get an understanding of what might be happening to slow the throughput.

You should also check the fabric.circuit events to see the path information. If both routers are behind NAT or not directly addressable, whatever is addressable, probably the router on the same node as the controller must be acting as a node in the path, so all the optimizations would need to be applied there as well. The fabric.circuit messages will show the nodes in the path to verify that. There were changes made to the settings for xgress flow control in 1.1.9, hence the question about all nodes and their versions. If the publicly addressable node is an older version, it could have the previous settings, which shouldn't cause throughput this low, but certainly wouldn't help either.

Hi Mike, and thank you for your reply. I have asked the person who runs our controller to add the configuration to gather xgress metrics. Unfortunately, I don't have access to the controller myself. While not every router in our infrastructure is on 1.1.15, I can say that every router involved in these communications should be, because I am using link/dialer groups to avoid routers talking to one another that I don't specifically want to do so. My generated config file also came with some stuff commented out at he bottom. This leads me to questions about transports. I have seen limited stuff in the docs regarding transwarp, and I was wondering how to go about enabling that? Do you have, by any chance, a fully annotated config file that lists all of the options for things, and what their min, max, and default values are? When I look at the docs on the website, I see some things defined like this: maxRetryInterval - (optional, default 1h, min 10ms, max 24h), which is totally clear to me. Some other things are defined like this: rxBufferSize - (optional, 41024 1024). Is that the default value, or the max value? I tend to send a lot of data over long distances, and I am looking for every knob I can turn, and a rough idea of how to turn it, to make that faster. I am almost certain that I want to do the xgress equivalent of making the TCP window size larger. I will see if I can get metrics from the controller. At one time, we tried polling them with the prometheus endpoint, and they would work for a bit, and then just stop. I know that my counterpart who runs the controller recently upgraded it, so maybe I should revisit if the metrics collection stays working.

Thanks!

Sorry. Got off on a tangent, and didn't manage to answer any of the questions:

  1. The iperf test runs for about 20 seconds
  2. TCP
  3. I use iperf to get a rough idea of the speed, and in this case it confirmed what I was seeing on the storage appliance with the very slow transfer speeds.
  4. In the case of my problematic site, only the on-prem router is behind a NAT. The 4 public routers are in AWS with Elastic IPs, and the on-prem routers always reach out to the routers in AWS. Since the communications often goes in both directions, depending on the services, sometimes the on-prem router is initiating, and sometimes it is terminating.
  5. 1.1.15 for all of the routers involved in this communication, but we do have a few of 1.1.5 and 1.1.8 on the fabric, which I would certainly be willing to upgrade if necessary.

Thanks!

Part of the question regarding the version is that the xgress options defaults were updated in 1.1.9 based on a bunch of performance testing we did, making them much more aggressive towards throughput. The xgress option definitions are available here.

I would check the events to be sure it is taking the path you think it is. The paths are selected by latency by default, and sometimes it just doesn't go like you think it should, especially when you are relatively new to ziti. If one of those 1.1.5 routers is involved, it could be slowing the process. Again, that's not enough to explain 10Mbps, but it wouldn't help, either. Those records will give you a second look to verify the link/dialer groups are working as intended, etc. as well. In the case you describe, you should have a direct shot between routers, but it never hurts to check.

You can run sudo ziti agent router dump-links to see the links on a router. Unfortunately, the output is in IDs, not names, so still has to be translated, but that's a way to verify that you do, indeed, have a direct link, if there is any question.

As to transwarp, I'm not sure much has been done with that in a while, nor how much of it was just adopted into the base project. It was a performance project a while back, and I just don't know. @michael.quigley or @plorenz may be able to give more information on that part.

You can enable transwarp with the appropriate address fixing. So where you've got tls:<host>:<port> replace that with transwarptls:<host>:<port> both for listeners and dialers.

Revamping the config system so we can emit consistent doc from the code is on my list, but I haven't gotten there yet. It's been coming up a lot recently, so I'll likely try to take pass at it after my current effort is complete.

You can see the address parsers available here: ziti/ziti/main.go at main ยท openziti/ziti ยท GitHub
Note that unreliable transports, should only be used for links, as edge and control channel connections assume reliable transports.

cheers,
Paul

1 Like

Mike,
It looks like my stats gathered from the prometheus endpoint are currently working, and since I figured out the magic incantation in the drilldown menu of Grafana (I'm new to it, but it is awesome) to find xgress stats, I now have awesome data to share! I don't know if it will be awesome news, but seeing all of the data is pretty cool. I am confident that when I learn what to do with the data, I will be able to improve things throughout my infrastructure.

I will share all of the stats that look indicative of issues, and not talk about various byterate charts and other data that I would expect to spike during a test.

During my tests, I see spikes in the following:
ziti_link_dropped_msgs
ziti_link_dropped_rtx_msgs
ziti_xgress_ack_duplicates
ziti_xgress_blocked_by_local_window
ziti_xgress_blocked_by_remote_window
ziti_xgress_dropped_payloads
ziti_xgress_payload_duplicates
ziti_xgress_retransmissions

ziti_xgress_tx_unacked_payload_bytes
This graph is strange to be because it sits at a high non-zero number (about 16.75Mil) and goes up during my test, and comes back down to 16.75Mil after I stopped sending data.

ziti_xgress_tx_unacked_payloads
This graph had similar behavior, except the baseline seems always incrementing, with spikes during my test, looks like a city on the side of a hill.

Which items do you suggest I start to tweaking? I am guessing maybe wait longer for retransmit to lower ziti_xgress_retransmissions? Or is that something where if the window is the right size, it just stops happening? Also, how do things like txPortalMaxSize relate to things set in sysctl.conf to open window sizes for TCP? Will I shoot myself in the foot if I set things to large in the router config file versus TCP settings at the OS level?

So nothing surprising without numbers attached. All of those would indicate flow control working, but the scale could mean it is not working well overall. The unacked payload behavior is odd. They should move in tandem, as one is messages (akin to datagrams) and the other the volume of the same, but I'm not sure why the baseline value would be 16MB.

If the links are dropping messages, they aren't able to keep up with the flows, and this could cause retransmissions, etc., like any infrastructure connection dropping packets. The ends could be doing it to, of course.

If you can get some of the graphs, and any data on the "good" transmissions, like link byte rates overall, maybe I can think of something. It certainly looks like the flow control is collapsing; errors caused by something causing back pressure and slowing the process considerably.

Hi Mike!
I didn't see that there was a button to upload things last night. Let me attach what I have for graphs that I was able to find in Grafana.








Also, regarding the previous question of whether or not my test was taking an undesirable path with routers in the middle, I don't think so. If I am understanding the CLI output correctly, it looks like the data goes from the initiating router, through one link, to the terminating router.

[greggw01@storziti01 ziti]$ ziti fabric list circuits 'limit none' | grep ttcp

โ”‚ qlnLgCk1z โ”‚ cma49hm8c6dhlmslbbxqgtlbg โ”‚ fsx.studios-ttcp.studios-ttcp.svc โ”‚ 2rcoSPGoC00RyTNYiU5wBV โ”‚ 2025-05-02 15:14:54 โ”‚ r/er-playa-fsx1 -> l/VTNAUdB3XED92IkeCLAS8 -> r/er-use1-studios5-fsx1 โ”‚

When I change my query from source_id, to sourceRouterId, I get link related stats: