Tuning suggestions for Overlay over a distance

Greetings,

I have some routers in my infrastructure which communicate across the world. One of the things I am seeing is a failure to send heartbeats. I get messages like this:

Aug 29 17:25:35 storziti02 ziti[2283277]: [235535.755] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}] error=[timeout waiting to put message in send queue: context deadline exceeded]} handleUnresponded failed to send heartbeat
Aug 29 17:26:46 storziti02 ziti[2283277]: [235606.057] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {error=[timeout waiting to put message in send queue: context deadline exceeded] channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}]} handleUnresponded failed to send heartbeat

Is there a timeout that I need to adjust to make sure this works better?

I am also thinking it would be a good idea to change the protocol being used on the underlay network to UDP, to avoid issues that latency causes TCP. Because right now, I think I am sending TCP traffic, inside a tunnel running over TCP, and the distance makes for pretty poor performance. Is this simply a matter of changing the "bind" and "advertise" lines on the link listener to udp, and making sure any firewalls (on-prem), network ACLs, and security groups (cloud) are adjusted accordingly?

Thanks!

Is setting up a router inbetween an option? E.g. if you're communicating between Asia and the U.S. setting up a router in e.g. Frankfurt or Amsterdam would be an option.
Nevertheless I'm sure there must be a timeout to be set.

The problem with setting a router up in the middle is that it would create outbound traffic from the cloud, which would cost money. My use case is for data migration. With no router in the middle, the data goes from on-prem, into the clould, which does not have a per GB cost.

A couple of items here to unpack.

  1. The failure to put message in send queue means that there are too many outstanding messages already. I would make sure that link is actually up (the l/XXX indicates the link id) The timeout on the messages are a few seconds, so even global scale routing isn't a problem for raw latency, we have many networks that operate that way all the time.
  2. We see a lot of failures occasionally across different links, if they are not constant, it is normal. They are sent all the time, so it can appear to be a real issue when it isn't. There can be a large number, but since we don't report successes, it's hard to tell a rate.
  3. There is an ongoing action to offer a DTLS (UDP) option for transport, but it is still in development, so stay tuned. (Active development, it's in a commit, and we're testing it, tweaking it, etc.) Until then, if you change the listener, it will fail.
  4. You can certainly modify the stack settings on your nodes to improve the TCP performance in the time being. We use the following in a sysctl file to boost the performance by giving more resources, especially on global scale networks, the window sizes are critical.
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 16777216
net.ipv4.udp_mem = 8388608 8388608 16777216
net.ipv4.tcp_retries2 = 8

Thanks for the recommended tunings, Mike. They helped a good bit! I am still interested in trying DTLS. I believe I saw something in recent release notes that makes me think it has been released in version 1.1.9, but when I try it, I get timeouts on the dialer. Are there any suggestions on using DTLS other than just changing the advertise and bind lines from tls to dtls? I am pretty sure firewalls are not an issue in my config, because I was testing in AWS, and I allowed all traffic from the specified list of IPs that are my routers.

Hi @greggw01

We've done testing with DTLS to verify it works and establish a performance baseline, but we're not using it in production. Using DTLS and TLS links on the same circuit hasn't been testing, it's likely there could be issues there.

Having said that, feel free to experiment with it :slight_smile:

Here are the config templates I was using for some DTLS testing: ziti/zititest/models/dtls-west/configs at main · openziti/ziti · GitHub

The main thing you're probably missing is setting the MTU for xgress.

dialers:
  - binding: tunnel
    options:
      mtu: 1435
listeners:
  - binding: tunnel
    options:
      mode: tproxy
      mtu: 1435

We haven't implemented MTU discovery for the DTLS layer so you'll have to find a setting that works. The setting above worked in testing on AWS.

Paul