Hi there,
I’ve been in contact with Clint regarding a case we had with a customer of ours.
The customer runs a Ziti Edge Tunnel and sends log messages (approx. 100 per second) via UDP port 1514 to the Ziti overlay that binds the UDP port on a different machine at port 1514.
We noticed that suddenly on July 2nd 15:01 we didn’t receive any logs anymore.
I’m not sure if I can attach a log file here, but here are the relevant lines before and after the incident:
Jul 02 12:55:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 330930.039] ERROR ziti-sdk:ziti_ctrl.c:155 ctrl_resp_cb() ctrl[zt.mydomain.de] request failed: -110(connection timed out)
Jul 02 12:55:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 330930.039] ERROR ziti-sdk:ziti_ctrl.c:155 ctrl_resp_cb() ctrl[zt.mydomain.de] request failed: -110(connection timed out)
Jul 02 12:55:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 330930.039] ERROR ziti-sdk:ziti.c:1051 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/connection timed out] from ctrl[https://zt.mydomain.de:8441]
Jul 02 12:55:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 330930.039] WARN tunnel-cbs:ziti_tunnel_ctrl.c:739 on_ziti_event() ziti_ctx controller connections failed: Ziti Controller is not available
Jul 02 12:55:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 330930.039] ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1190 on_event() ztx[/opt/openziti/etc/identities/customer.json] failed to connect to controller due to Ziti Controller is not available
Jul 02 14:54:05 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338030.422] WARN ziti-sdk:channel.c:328 on_channel_send() ch[0] write delay = 1.758 q=126 qs=71657
Jul 02 15:01:35 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338479.782] ERROR tunnel-cbs:ziti_tunnel_cbs.c:375 on_ziti_write() ziti_write(ziti_conn[0x5631317871f0]) failed: Operation did not complete in time
Jul 02 15:01:36 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338481.203] ERROR tunnel-cbs:ziti_tunnel_cbs.c:375 on_ziti_write() ziti_write(ziti_conn[0x563131a10b20]) failed: Operation did not complete in time
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR ziti-sdk:channel.c:582 latency_timeout() ch[0] no read/write traffic on channel since before latency probe was sent, closing channel
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR ziti-sdk:connect.c:928 connect_reply_cb() conn[0.1116/Connecting] failed to connect [-19/no such device]
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR ziti-sdk:connect.c:928 connect_reply_cb() conn[0.1115/Connecting] failed to connect [-19/no such device]
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] WARN ziti-sdk:connect.c:1423 queue_edge_message() conn[0.1116/Closed] disconnecting from state[9]
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] WARN ziti-sdk:connect.c:1423 queue_edge_message() conn[0.1115/Closed] disconnecting from state[9]
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] WARN ziti-sdk:channel.c:328 on_channel_send() ch[0] write delay = 20.196 q=253 qs=147282
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR ziti-sdk:channel.c:349 on_channel_send() ch[0] write failed [-125/operation canceled]
... hundreds of operation cancelled logs ...
Jul 02 15:01:45 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338489.978] ERROR ziti-sdk:channel.c:349 on_channel_send() ch[0] write failed [-125/operation canceled]
Jul 02 15:01:51 HOSTNAME ziti-edge-tunnel[32534]: (32534)[ 338496.382] ERROR tunnel-sdk:netif_shim.c:76 on_packet() pbuf_alloc failed
... thousands of pbuf_alloc failed logs ...
Any ideas what happened here?
Thanks!
Dominik
Thanks @dmuensterer. I fwd’ed the logs over to the relevant people. It’s better/easier for all involved to track here and serves the dual purpose of being searchable for someone else in the future (if the information is relevant at that time)
We’ll be in touch.
1 Like
Some additional information from the controller logs:
Jul 2 15:00:36 zt ziti-router[1201863]: {"_context":"{c/Po.uaYnh-|@/J5l2n}\u003cTerminator\u003e","error":"payload buffer closed","file":"github.com/openziti/fabric@v0.21.9/router/xgress/xgress.go:564","func":"github.com/openziti/fabric/router/xgress.(*Xgress).forwardPayload","l
evel":"error","msg":"failure to buffer payload","time":"2023-07-02T15:00:36.861Z"}
Jul 2 15:01:35 zt ziti-controller[1653554]: {"level":"info","msg":"http: TLS handshake error from 142.132.XXX.XXX:38590: read tcp XXX.XXX.XXX.XXX:8441-\u003e142.132.XXX.XXX:38590: i/o timeout","time":"2023-07-02T15:01:35.727Z"}
Hi, any ideas? Here’s my projection of what happened, do you think that might be accurate?
- Somehow the controller at zt.mydomain.de was not available
- The client sends hundreds of messages per second. Since this is UDP, there is no stream but instead every message forces UDP packets to be sent, thus resulting in at least one error per single UDP message.
- Once the limit of 512 simultaneous connections was reached, the client didn’t stop sending (as it couldn’t detect the opposite side wasn’t available - UDP fire&forget)
- At some point, all available pbufs were allocated resulting in the ziti-edge-tunnel not being able to recover by itself.
Does this make sense?
If we used TCP instead, would the outcome have been different?
Thanks.
Thank you for the bump on this. Yes, your hypothesis makes sense as to what happened. If you used TCP and the client still tries to connect over and over again in a very short window of time, I’m not sure, but I do think the result would have been the same.
@scareything or @ekoby were either of you able to look at the logs? Any additional thoughts?
1 Like
I’m still not sure what happened to the controller that led to the ziti writes and dials failing, but yes once that happened ziti-edge-tunnel
was unable to proxy the packets that it was receiving from the clients.
It should have recovered once controller connectivity was restored, but from the logs @TheLumberjack shared with me it seems that it didn’t.
Which version of ziti-edge-tunnel
is this? I can see the ziti SDK version (0.31.2) in the logs , which probably means this is ziti-edge-tunnel
0.20.21 or 0.20.22. There were some pbuf leaks fixed in 0.20.23 that you might be seeing here.
Some background on UDP processing:
When ziti-edge-tunnel
receives one or more packets for a ziti service from a new UDP client (as identified by source ip:port), the packets are queued up in “pbufs” (packet buffers) so they can be sent once the ziti connection is established. The packets are retained until:
- the ziti dial succeeds, in which case they are written to the ziti connection
- the ziti dial fails, in which case they are dropped/free’d upon failure
- a 30-second timer that was started when the packet was received elapses
Some elaboration on 2 - ziti dials may fail by timing out (after 10 seconds). If the dial fails for this reason then ziti-edge-tunnel
continues to queue up packets for the timeout duration. Furthermore, due to the way ziti-edge-tunnel
sequences connection closure, the ziti connection must be closed before the udp connection is closed (and the associated queued packets are freed). The closure of the ziti connection requires communication with the edge router, which may be degraded. If the ziti close is delayed then the 30 second packet timer will ultimately be what frees the packet(s).
So if we’re in this state and your clients are sending more than one packet every 10 (or 30) seconds then it’s only a matter of time before the packet buffer pool is exhausted. The pool is (should be) churning the entire time, but it’s always full. It should recover when the ziti controller/router connectivity is restored and packets can once again be offloaded.
By the way TCP is slightly different in this regard since ziti--edge-tunnel
doesn’t queue packets for TCP clients. Instead it withholds sending ACKs to clients until the ziti dial is successful (or the previous packet was successfully written to ziti, if the connection is established), and so the tcp clients don’t send data unless the TCP window permits. This isn’t to say that you can’t fill the packet buffer pool with enough concurrent and/or active TCP clients, but the inherent flow control definitely helps.
1 Like