Reconnecting Tunneler on network errors

We're running quite a couple of Ziti tunnelers which we want to keep running also on network disconnects that are not caused by Ziti. Right now, we sometimes get situations where the tunneler gives up after some time which is not what we want. Instead, I'd like the tunneler to continue endlessly until it can reach controller and/or routers.
I'm sure there are some retry counters we need to adjust here. Which are those? Are there pitfalls?

Here's some of the errors we get, after which the tunnelers aren't connecting anymore:

Occasion 1:

Mai 02 01:48:32 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671243.133]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281324/HqwtJ1uO/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:32 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671243.133]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.559]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281326/8-vhI5J-/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.559]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.559]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281325/7bPQmF9T/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.559]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281330/inSbVX7l/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281329/vOQN2wy0/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281327/ArLoqy-O/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281328/zTmUyXb1/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:47 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671258.560]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:49 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671260.142]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281331/fAcvG1Q5/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:49 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671260.142]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:48:53 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671261.134]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.281332/ei8k2jI1/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Mai 02 01:48:53 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671261.134]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Mai 02 01:49:01 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671269.234]   ERROR tunnel-sdk:tunnel_tcp.c:212 tunneler_tcp_write() failed to tcp_write -11 (936, 936)
Mai 02 01:49:01 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671269.234]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:121 on_ziti_data() failed to write to client
Mai 02 01:49:01 srv1 ziti-edge-tunnel[3282449]: (3282449)[   671269.234]    WARN ziti-sdk:connect.c:820 flush_to_client() conn[0.281337/o68Hy9J0/Closed] client indicated error[-1] accepting data (0 bytes buffered)

Occasion 2:

May 02 03:12:10 srv2 ziti-edge-tunnel[385593]: (385593)[   797022.866]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.254261/Z0mhhwzq/Connecting] failed to connect, reason=exceeded maximum [2] retries creating >
May 02 03:12:10 srv2 ziti-edge-tunnel[385593]: (385593)[   797022.866]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
May 02 03:12:11 srv2 ziti-edge-tunnel[385593]: (385593)[   797023.937]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:39049 err=-14, terminating connection
May 02 03:12:11 srv2 ziti-edge-tunnel[385593]: (385593)[   797023.945]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:56069 err=-14, terminating connection
May 02 03:12:11 srv2 ziti-edge-tunnel[385593]: (385593)[   797023.946]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[254263]
May 02 03:12:11 srv2 ziti-edge-tunnel[385593]: (385593)[   797023.948]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[254262]
May 02 03:12:11 srv2 ziti-edge-tunnel[385593]: (385593)[   797023.949]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:44757 err=-14, terminating connection

Hi and thanks for letting us know about this. The logs that you show are related to the tunneler trying to make connections for services, and those will fail if the connections to the ER(s) and/or controllers are failing. So these log messages are a showing the result of a symptom, and I'm guessing there's something else in your logs that helps reveal the symptom.

Which version are you running? Can you send the complete logs for one of the tunnelers that's in this condition?

Thanks!

@scareything and I are already in contact, however I wanted to share some of the logs here as well, in case someone has the same problem or can help.
I am very certain that the problem is caused somehow by a wildcard service that we have for our monitoring system, Zabbix. We have a service that allows access from the Zabbix server to *.ziti. We have some tunnelers that don't use this service and they are the only one's without the problem.
These are the last logs after which the ziti-edge-tunnel becomes "useless" until it's being restarted.

May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:34 netif_shim_output() writing packet TCP[100.64.0.3:10051 -> 100.64.0.1:47923] len=83
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:817 flush_to_client() conn[0.12215/jymz5xFN/Connected] client consumed 43 out of 43 bytes
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-cbs:ziti_tunnel_cbs.c:111 on_ziti_data() got -19 bytes from ziti
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-cbs:ziti_tunnel_cbs.c:126 on_ziti_data() ziti connection sent EOF (ziti_eof=0, tnlr_eof=0)
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:ziti_tunnel.c:465 ziti_tunneler_close_write() closing write connection: client[tcp:100.64.0.1:47923] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:235 tunneler_tcp_close_write() closing write src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] state[4/ESTABLISHED] flags[0x100] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:34 netif_shim_output() writing packet TCP[100.64.0.3:10051 -> 100.64.0.1:47923] len=40
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:241 tunneler_tcp_close_write() closed write src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] state[5/FIN_WAIT_1] flags[0x120] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:798 flush_to_service() conn[0.12215/jymz5xFN/Connected] flushed 0 messages
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:757 on_flush() conn[0.12215/jymz5xFN/Connected] stopping flusher
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:58 netif_shim_input() received packet TCP[100.64.0.1:47923 -> 100.64.0.3:10051] len=40
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:tunnel_tcp.c:366 recv_tcp() received segment src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] flags[ACK]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:62 netif_shim_input() done after reading 1 packets
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:58 netif_shim_input() received packet TCP[100.64.0.1:47923 -> 100.64.0.3:10051] len=40
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:tunnel_tcp.c:366 recv_tcp() received segment src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] flags[FIN,ACK]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019] VERBOSE tunnel-sdk:tunnel_tcp.c:141 on_tcp_client_data() status 0 src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] state[10/TIME_WAIT] flags[0x123] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:145 on_tcp_client_data() client sent FIN: client=tcp:100.64.0.1:47923, service=zabbix_agent.svc
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:146 on_tcp_client_data() FIN received src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] state[10/TIME_WAIT] flags[0x123] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-cbs:ziti_tunnel_cbs.c:157 ziti_sdk_c_close_write() closing ziti_conn tnlr_eof=0, ziti_eof=1
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-cbs:ziti_tunnel_cbs.c:160 ziti_sdk_c_close_write() closing ziti_conn tnlr_eof=1, ziti_eof=1
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:764 flush_connection() conn[0.12215/jymz5xFN/Connected] starting flusher
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:34 netif_shim_output() writing packet TCP[100.64.0.3:10051 -> 100.64.0.1:47923] len=40
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-sdk:netif_shim.c:62 netif_shim_input() done after reading 1 packets
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019] VERBOSE ziti-sdk:connect.c:811 flush_to_client() conn[0.12215/jymz5xFN/Connected] 0 bytes available
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:channel.c:384 ziti_channel_send_message() ch[2] => ct[ED71] seq[168] len[0]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:channel.c:356 on_channel_send() ch[2] write delay = 0.000d q=1 qs=44
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:228 on_write_completed() conn[0.12215/jymz5xFN/Connected] status 0
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019] VERBOSE ziti-sdk:connect.c:105 conn_set_state() conn[0.12215/jymz5xFN/Connected] transitioning Connected => Closed
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG ziti-sdk:channel.c:253 ziti_channel_rem_receiver() ch[2] removed receiver[12215]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:798 flush_to_service() conn[0.12215/jymz5xFN/Closed] flushed 1 messages
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:757 on_flush() conn[0.12215/jymz5xFN/Closed] stopping flusher
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG ziti-sdk:connect.c:172 close_conn_internal() conn[0.12215/jymz5xFN/Closed] removing
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE tunnel-cbs:ziti_tunnel_cbs.c:594 ziti_conn_close_cb() ziti_conn[0x5651115226b0] is closed
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:ziti_tunnel.c:434 ziti_tunneler_close() closing connection: client[tcp:100.64.0.1:47923] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:251 tunneler_tcp_close() closing src[tcp:100.64.0.1:47923] dst[tcp:100.64.0.3:10051] state[10/TIME_WAIT] flags[0x120] service[zabbix_agent.svc]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG tunnel-sdk:tunnel_tcp.c:267 tunneler_tcp_close() closed src[] dst[] state[10/TIME_WAIT] flags[0x130] service[]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019] VERBOSE tunnel-cbs:ziti_tunnel_cbs.c:606 ziti_conn_close_cb() nulled data for ziti_conn[0x5651115226b0]
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   TRACE ziti-sdk:connect.c:213 close_conn_internal() conn[0.12215/jymz5xFN/Closed] is being free()'d
May 31 03:04:44 hostname ziti-edge-tunnel[1243372]: (1243372)[    38294.019]   DEBUG ziti-sdk:ziti.c:1749 grim_reaper() ztx[0] reaped 1 closed (out of 2 total) connections
1 Like

Hi,

Are there any logs from after the point at which the tunneler goes unresponsive? I'm really hoping to see those to ascertain what might be preventing the retry logic from working.

Thanks!

1 Like

Hi, that’s all there is!
I just did a journalctl -u ziti-edge-tunnel and removed the lines from before the last start.
At the end, that’s it - no more logs from there

Ahh, ok. I think there may be a few different situations that result in ziti-edge-tunnel going lame.

For example in the initial post to this thread, and in subsequent messages that you've shared with me directly, the problem was that ziti-edge-tunnel had lost connection to the controller and was not reconnecting even though it should have been able to.

The logs that you sent today show ziti-edge-tunnel going completely silent, and I'd guess that if you looked at the process activity (e.g. in top) you'd see that it was burning an entire cpu. This issue I've seen sporadically when intercepted tcp connections are closed and somehow the list that is used to track tcp connections becomes circular, resulting in the next call to the tcp stack to into a tight loop.

Have you noticed which of these scenarios is happening more frequently for you?

Thanks

1 Like

It's very hard to tell which problem occurs more often because it has the same symptoms for us - ziti-edge-tunnel not being available anymore. I'd think that the latter scenario with ziti-edge-tunnel going completely silent happens more often.
And I think you're right with the CPU observation because that's also what we observed this night. Interestingly enough, this night we had two ziti-edge-tunnels which both quit pretty much at the same time. Both tunnelers have nothing in common except they both bind using the same wildcard service.

Do you have any idea in mind on how to solve the problem or what might be the root cause of ziti-edge-tunnel ending up in this situation?

Thanks!

This situation looks like a bug that I think is somehow related to intercepted TCP connections being closed or abandoned either before or after reaching the ESTABLISHED state. I'm looking into it now.

1 Like

Thank you. If you have an (unofficial) release to test or if there's any other way I can support, please let me know.

1 Like