Tuning suggestions for Overlay over a distance

greggw01 · August 29, 2024, 6:26pm

Greetings,

I have some routers in my infrastructure which communicate across the world. One of the things I am seeing is a failure to send heartbeats. I get messages like this:

Aug 29 17:25:35 storziti02 ziti[2283277]: [235535.755] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}] error=[timeout waiting to put message in send queue: context deadline exceeded]} handleUnresponded failed to send heartbeat
Aug 29 17:26:46 storziti02 ziti[2283277]: [235606.057] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {error=[timeout waiting to put message in send queue: context deadline exceeded] channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}]} handleUnresponded failed to send heartbeat

Is there a timeout that I need to adjust to make sure this works better?

I am also thinking it would be a good idea to change the protocol being used on the underlay network to UDP, to avoid issues that latency causes TCP. Because right now, I think I am sending TCP traffic, inside a tunnel running over TCP, and the distance makes for pretty poor performance. Is this simply a matter of changing the "bind" and "advertise" lines on the link listener to udp, and making sure any firewalls (on-prem), network ACLs, and security groups (cloud) are adjusted accordingly?

Thanks!

dmuensterer · August 30, 2024, 9:21am

Is setting up a router inbetween an option? E.g. if you're communicating between Asia and the U.S. setting up a router in e.g. Frankfurt or Amsterdam would be an option.
Nevertheless I'm sure there must be a timeout to be set.

greggw01 · September 4, 2024, 6:32pm

The problem with setting a router up in the middle is that it would create outbound traffic from the cloud, which would cost money. My use case is for data migration. With no router in the middle, the data goes from on-prem, into the clould, which does not have a per GB cost.

mike.gorman · September 4, 2024, 7:01pm

A couple of items here to unpack.

The failure to put message in send queue means that there are too many outstanding messages already. I would make sure that link is actually up (the l/XXX indicates the link id) The timeout on the messages are a few seconds, so even global scale routing isn't a problem for raw latency, we have many networks that operate that way all the time.
We see a lot of failures occasionally across different links, if they are not constant, it is normal. They are sent all the time, so it can appear to be a real issue when it isn't. There can be a large number, but since we don't report successes, it's hard to tell a rate.
There is an ongoing action to offer a DTLS (UDP) option for transport, but it is still in development, so stay tuned. (Active development, it's in a commit, and we're testing it, tweaking it, etc.) Until then, if you change the listener, it will fail.
You can certainly modify the stack settings on your nodes to improve the TCP performance in the time being. We use the following in a sysctl file to boost the performance by giving more resources, especially on global scale networks, the window sizes are critical.

net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 16777216
net.ipv4.udp_mem = 8388608 8388608 16777216
net.ipv4.tcp_retries2 = 8

greggw01 · November 12, 2024, 10:03pm

Thanks for the recommended tunings, Mike. They helped a good bit! I am still interested in trying DTLS. I believe I saw something in recent release notes that makes me think it has been released in version 1.1.9, but when I try it, I get timeouts on the dialer. Are there any suggestions on using DTLS other than just changing the advertise and bind lines from tls to dtls? I am pretty sure firewalls are not an issue in my config, because I was testing in AWS, and I allowed all traffic from the specified list of IPs that are my routers.

plorenz · November 13, 2024, 7:10pm

Hi @greggw01

We've done testing with DTLS to verify it works and establish a performance baseline, but we're not using it in production. Using DTLS and TLS links on the same circuit hasn't been testing, it's likely there could be issues there.

Having said that, feel free to experiment with it

Here are the config templates I was using for some DTLS testing: ziti/zititest/models/dtls-west/configs at main · openziti/ziti · GitHub

The main thing you're probably missing is setting the MTU for xgress.

dialers:
  - binding: tunnel
    options:
      mtu: 1435
listeners:
  - binding: tunnel
    options:
      mode: tproxy
      mtu: 1435

We haven't implemented MTU discovery for the DTLS layer so you'll have to find a setting that works. The setting above worked in testing on AWS.

Paul

greggw01 · February 18, 2025, 4:33pm

Thank you to everyone for the helpful suggestions so far. This quarter, I am focused on testing the DTLS functionality within OpenZiti, and I have gotten it functionally working in AWS, but I have not yet achieved an increase in speed using UDP for the underlay protocol as opposed to TCP, for sending data to the other side of the world. I realize that TCP acknowledgements will still be an issue for the storage replication that I am doing, but I thought that removing that overhead from the underlay protocol would yield an increase in transfer speed. With that in mind, are there any variables that I should be setting, related specifically to DTLS? I did see in the reference config that there was a variable for the max throughput. My assumption was that if I do not set this variable specifically, then the speed is not limited intentionally. Is that a correct understanding of how that works?
Thanks,
-Gregg

plorenz · February 18, 2025, 4:39pm

That's correct. I was experimenting with write smoothing to see if that would be helpful. It didn't seem to make any difference in testing.

Paul

Topic		Replies	Views
Tips to optimise router configuration Ziti Overlay	5	196	May 25, 2022
Client show error "context deadline exceeded" Ziti Overlay	20	161	March 7, 2025
Smart routing and link failure Ziti Overlay	2	288	June 12, 2023
Edge Tunnel Keep Alice Timeout zrok	2	23	December 12, 2024
-103/software caused connection abort Ziti Overlay	4	40	November 8, 2024

Tuning suggestions for Overlay over a distance

Related topics