I have some routers in my infrastructure which communicate across the world. One of the things I am seeing is a failure to send heartbeats. I get messages like this:
Aug 29 17:25:35 storziti02 ziti[2283277]: [235535.755] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}] error=[timeout waiting to put message in send queue: context deadline exceeded]} handleUnresponded failed to send heartbeat
Aug 29 17:26:46 storziti02 ziti[2283277]: [235606.057] ERROR channel/v2.(*heartbeater).sendHeartbeatIfQueueFree: {error=[timeout waiting to put message in send queue: context deadline exceeded] channelId=[ch{l/4eQTK7E7oW4sJXBXHf4Ivr}->u{classic}->i{PWoP}]} handleUnresponded failed to send heartbeat
Is there a timeout that I need to adjust to make sure this works better?
I am also thinking it would be a good idea to change the protocol being used on the underlay network to UDP, to avoid issues that latency causes TCP. Because right now, I think I am sending TCP traffic, inside a tunnel running over TCP, and the distance makes for pretty poor performance. Is this simply a matter of changing the "bind" and "advertise" lines on the link listener to udp, and making sure any firewalls (on-prem), network ACLs, and security groups (cloud) are adjusted accordingly?
Is setting up a router inbetween an option? E.g. if you're communicating between Asia and the U.S. setting up a router in e.g. Frankfurt or Amsterdam would be an option.
Nevertheless I'm sure there must be a timeout to be set.
The problem with setting a router up in the middle is that it would create outbound traffic from the cloud, which would cost money. My use case is for data migration. With no router in the middle, the data goes from on-prem, into the clould, which does not have a per GB cost.
The failure to put message in send queue means that there are too many outstanding messages already. I would make sure that link is actually up (the l/XXX indicates the link id) The timeout on the messages are a few seconds, so even global scale routing isn't a problem for raw latency, we have many networks that operate that way all the time.
We see a lot of failures occasionally across different links, if they are not constant, it is normal. They are sent all the time, so it can appear to be a real issue when it isn't. There can be a large number, but since we don't report successes, it's hard to tell a rate.
There is an ongoing action to offer a DTLS (UDP) option for transport, but it is still in development, so stay tuned. (Active development, it's in a commit, and we're testing it, tweaking it, etc.) Until then, if you change the listener, it will fail.
You can certainly modify the stack settings on your nodes to improve the TCP performance in the time being. We use the following in a sysctl file to boost the performance by giving more resources, especially on global scale networks, the window sizes are critical.
Thanks for the recommended tunings, Mike. They helped a good bit! I am still interested in trying DTLS. I believe I saw something in recent release notes that makes me think it has been released in version 1.1.9, but when I try it, I get timeouts on the dialer. Are there any suggestions on using DTLS other than just changing the advertise and bind lines from tls to dtls? I am pretty sure firewalls are not an issue in my config, because I was testing in AWS, and I allowed all traffic from the specified list of IPs that are my routers.
We've done testing with DTLS to verify it works and establish a performance baseline, but we're not using it in production. Using DTLS and TLS links on the same circuit hasn't been testing, it's likely there could be issues there.