Greetings,
I have some routers in my infrastructure which communicate across the world. One of the things I am seeing is a failure to send heartbeats. I get messages like this:
Aug 29 17:25:35 storziti02 ziti[2283277]: [235535.755] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}] error=[timeout waiting to put message in send queue: context deadline exceeded]} handleUnresponded failed to send heartbeat
Aug 29 17:26:46 storziti02 ziti[2283277]: [235606.057] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {error=[timeout waiting to put message in send queue: context deadline exceeded] channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}]} handleUnresponded failed to send heartbeat
Is there a timeout that I need to adjust to make sure this works better?
I am also thinking it would be a good idea to change the protocol being used on the underlay network to UDP, to avoid issues that latency causes TCP. Because right now, I think I am sending TCP traffic, inside a tunnel running over TCP, and the distance makes for pretty poor performance. Is this simply a matter of changing the "bind" and "advertise" lines on the link listener to udp, and making sure any firewalls (on-prem), network ACLs, and security groups (cloud) are adjusted accordingly?
Thanks!
Is setting up a router inbetween an option? E.g. if you're communicating between Asia and the U.S. setting up a router in e.g. Frankfurt or Amsterdam would be an option.
Nevertheless I'm sure there must be a timeout to be set.
The problem with setting a router up in the middle is that it would create outbound traffic from the cloud, which would cost money. My use case is for data migration. With no router in the middle, the data goes from on-prem, into the clould, which does not have a per GB cost.
A couple of items here to unpack.
- The failure to put message in send queue means that there are too many outstanding messages already. I would make sure that link is actually up (the l/XXX indicates the link id) The timeout on the messages are a few seconds, so even global scale routing isn't a problem for raw latency, we have many networks that operate that way all the time.
- We see a lot of failures occasionally across different links, if they are not constant, it is normal. They are sent all the time, so it can appear to be a real issue when it isn't. There can be a large number, but since we don't report successes, it's hard to tell a rate.
- There is an ongoing action to offer a DTLS (UDP) option for transport, but it is still in development, so stay tuned. (Active development, it's in a commit, and we're testing it, tweaking it, etc.) Until then, if you change the listener, it will fail.
- You can certainly modify the stack settings on your nodes to improve the TCP performance in the time being. We use the following in a sysctl file to boost the performance by giving more resources, especially on global scale networks, the window sizes are critical.
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 8388608 8388608 16777216
net.ipv4.udp_mem = 8388608 8388608 16777216
net.ipv4.tcp_retries2 = 8