Edge Tunnel crashes with state[0/CLOSED]

We had a weird crash today, which resulted in Ziti-Edge-Tunnel 0.22.22 not crashing on a systemd level but effectively crashing and we don't know what caused it. At 05:19, all services weren't working anymore.
Do you have any ideas how we could prevent that in the future? Any possibility this could be caused by time differences of the controller and Ziti Edge Tunnel?
We're getting this error sometimes, that the Ziti Controller is 30 seconds behind. What kind of consequences does this have?
Would it be possible to let the Edge Tunnel crash here so it can get restarted by systemd?

Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.627]   ERROR ziti-sdk:connect.c:963 connect_reply_cb() conn[0.25070/H8eIhzJ4/Connecting] failed to connect, reason=timeout waiting for message reply: context deadline exceeded
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.627]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.769]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:49131 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.769]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:33837 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.808]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:37579 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.808]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:54253 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.808]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:55023 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.809]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:54025 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.809]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:54181 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.849]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25071]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.849]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25076]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.850]   ERROR tunnel-sdk:tunnel_tcp.c:188 on_tcp_client_err() client=tcp:100.64.0.1:47997 err=-14, terminating connection
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.872]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25078]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.872]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25074]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.923]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25079]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.923]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25075]
Feb 28 05:19:30 hostname ziti-edge-tunnel[1072214]: (1072214)[    67379.923]    WARN ziti-sdk:channel.c:524 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[25080]
Feb 28 05:19:38 hostname ziti-edge-tunnel[1072214]: (1072214)[    67386.529]   ERROR tunnel-sdk:tunnel_tcp.c:238 tunneler_tcp_close_write() tcp_shutdown failed: err=-11 src[tcp:100.64.0.1:36947] dst[tcp:100.64.0.4:10051] state[0/CLOSED] flags[0x100] service[zabbix_agent.svc]

It's hard to say exactly what happened here, but my best guess is that the event loop that ziti-edge-tunnel uses for i/o became stalled for some reason. I don't think it has anything to do with the time delta from the server, but I'll take a look at that code and try a few things. The error conditions that are being logged aren't normal but they also aren't necessarily something we'd want to stop the tunneler for either.

Could you send the full logs/journal from this run of ziti-edge-tunnel?

Thanks for PMing the logs to me. It looks like something catastrophic happened before the journal rolled. The large number of ziti context is not authenticated suggests connection with the controller was lost and ziti-edge-tunnel never got reconnected. As far as I can tell it didn't even try to reconnect.

I'm sending the info to a teammate who better understands the controller interactions. If you encounter this situation again it would be helpful to have increased log levels as well as increased journal capacity.

With or without increased logs we might be able to get some post-mortem details by running a diagnostic script that we've been working on recently. It expects the process to have been started by the the systemd unit file that we install with the ziti-edge-tunnel Linux packages.

1 Like

From what we can see in the logs that you DM'd me it looks like everything is fine until connecting to the controller fails with "temporary failure":

Mar 11 04:46:15 hostname ziti-edge-tunnel[4399]: (4399)[  2124013.075]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[controller.ziti] request failed: -3001(temporary failure)
Mar 11 04:46:15 hostname ziti-edge-tunnel[4399]: (4399)[  2124013.075]   ERROR ziti-sdk:ziti.c:1308 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/temporary failure
Mar 11 04:46:15 hostname ziti-edge-tunnel[4399]: (4399)[  2124013.075]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[controller.ziti] request failed: -3001(temporary failure)
Mar 11 04:46:15 hostname ziti-edge-tunnel[4399]: (4399)[  2124013.075]   ERROR ziti-sdk:ziti.c:1100 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/temporary failure] from ctrl[https://controller.ziti:8441]
Mar 11 04:46:15 hostname ziti-edge-tunnel[4399]: (4399)[  2124013.075]    WARN tunnel-cbs:ziti_tunnel_ctrl.c:781 on_ziti_event() ziti_ctx controller connections failed: ziti controller is not available

Note that there was a failed connection to the controller prior to this in your logs, but the reason was "host unreachable" and it recovered on the first or second retry. But once the connection fails due to "temporary failure" it never recovers. The session that authorized ziti-edge-tunnel to communicate with the controller eventually expires:

Mar 11 05:08:58 hostname ziti-edge-tunnel[4399]: (4399)[  2125376.679]    WARN ziti-sdk:ziti.c:1598 api_session_cb() ztx[0] failed to get api session from ctrl[https://controller.ziti:8441] api_session_state[3] CONTROLLER_UNAVAILABLE[-16] temporary failure

And the session cannot be renewed because whatever was causing the "temporary failure" is still happening. Now because of the expired session we see a lot of UNAUTHORIZED error messages thrown into the mix, but the root of the problem is still CONTROLLER_UNAVAILABLE due to temporary failure

The connection error code "-3001" is being returned by getaddrinfo(), which ziti-edge-tunnel uses to resolve the controller's hostname to an IP address that will then be connected to. The symbol for this code is EAI_AGAIN, and the description from the documentation is vague:

The name server returned a temporary failure indication. Try again later.

Do you know of anything in your environment that could be a factor in DNS lookups failing? Is anything happening around 05:00-05:30? I'm assuming a restart of ziti-edge-tunnel fixes the issue temporarily (perhaps an overused word here).

It would be helpful to see a packet capture of DNS (port 53) packets on a host when this is happening. Also which DNS resolver are these hosts using... can you show me the resolver configuration when these "temporary" errors are happening? For example, if the resolver is systemd-resolved you can run resolvectl status.

Thanks,
-Shawn

1 Like

Hi Shawn,

thanks for the thorough reply which was really helpful.
Indeed we could see that some service was interfering with systemd-resolved and causes DNS issues mostly only for Ziti. We were able to fix this now.

Thanks
Dominik

1 Like