In our deployment, Ziti edge tunneler appears to lose connectivity and break the communication between endpoints when bandwidth utilization either on the remote ZET machine or through the tunnel is high. Can anyone suggest some tuning that might be required here to address this?
In this PoC we have a Ziti controller, public router and ziti-edge-tunneler (ZET) on machine A. We opted to use ZET instead of just the router in tproxy mode because the DNS requirements for tproxy mode where the router must be the primary DNS provider and there doesn’t seem to be an option to provide an upstream resolver like with the ZET or split DNS, making it incompatible with our use case. I did see an option for fulfilling the tproxy router mode with a file base DNS option which might work for us, but I haven’t tested that, don’t know what the downside might be and so we stuck with using a ZET in place of the tproxy router mode for now.
This machine A with controller, public router, ZET is acting as a gateway (ZTNA) and is receiving LAN traffic to send to machine B on the other side of the ZET service tunnel.
Machine B has a ZET installed directly on it for host access mode (ZTHA) and receives tunnel data from ZET on machine A.
What we are observing from client software in the LAN utilizing the ZTNA of machine A is that the remote ZET (machine B) is randomly disconnecting in the middle of high bandwidth usage. This is causing breakage in, for example, long running CI jobs where a CI client will get halfway through a batch transfer of files required for some CI action and the result is a pipeline build failure. Example output as seen from the client; note in this case, the file transfers occurring in the logs are not through the Ziti tunnel, but being pulled locally on the remote machine at 192.168.3.1:
2022-11-22T08:12:46.173993659+00:00 stderr F github-required> copying path '/nix/store/k4s57mhixpyqr8vqhbx4gaa2f93gd5dc-flat-lib-flat-0.5.2' from 'http://192.168.3.1:8081'...
2022-11-22T08:12:46.540767006+00:00 stderr F github-required> Connection to 10.10.0.1 closed by remote host.
2022-11-22T08:12:46.545284558+00:00 stderr F error: unexpected end-of-file
2022-11-22T08:12:46.725614902+00:00 stderr F error: builder for '/nix/store/v3h24ckjln19z65wikafldg7wfw684qj-github-required.drv' failed with exit code 1
On the Machine B ZET logs, which is where the logs above indicate the connection closure is occurring from, we see at this matching time, we have the following:
Nov 22 08:12:02 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213091.773] WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
Nov 22 08:12:02 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213091.784] WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
Nov 22 08:12:02 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213091.796] WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
Nov 22 08:12:02 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213091.807] WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
Nov 22 08:12:03 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213092.870] INFO ziti-sdk:ziti.c:1415 ziti_set_api_session() ztx[1] api session set, setting api_sess
ion_timer to 1740s
Nov 22 08:12:10 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213099.211] INFO ziti-sdk:ziti.c:1415 ziti_set_api_session() ztx[0] api session set, setting api_sess
ion_timer to 1740s
Nov 22 08:12:46 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213135.418] INFO tunnel-cbs:ziti_tunnel_ctrl.c:804 on_ziti_event() ztx[MACHINE_B.ziti] router $CONTROLLER_FQDN@tls://$CONTROLLER_FQDN:3022 disconnected
Nov 22 08:12:46 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213135.418] INFO ziti-sdk:channel.c:739 reconnect_channel() ch[0] reconnecting in 8558 ms (attempt = 1)
Nov 22 08:12:47 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213135.985] INFO ziti-sdk:channel.c:742 reconnect_channel() ch[0] reconnecting NOW
Nov 22 08:12:47 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213136.530] ERROR ziti-sdk:channel.c:485 dispatch_message() ch[0] received unexpected message ct[ED72] in Connecting state
Nov 22 08:12:47 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213136.530] ERROR ziti-sdk:channel.c:496 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[4857]
Nov 22 08:12:47 MACHINE_B ziti-edge-tunnel[4014516]: (4014516)[ 1213136.530] ERROR ziti-sdk:channel.c:485 dispatch_message() ch[0] received unexpected message ct[ED72] in Connecting state
In machine B’s ZET logs, we first see can’t alloc message WARNs which seem to be logged in significant quantity whenever network bandwidth is high, followed by a disconnection message to the Ziti controller after which automatic reconnection is attempted and does succeed after a second or two. Unfortunately, though, the software utilizing the tunnel during this event breaks (ex: the CI jobs mentioned above).
On the machine A side, the ZET systemd service does not have any logs of significance at this timestamp. On the ziti-router and ziti-controller systemd services, logs of relevance at this timestamp are:
# Router logs on machines A:
Nov 22 08:12:46 zt ziti-router[271796]: {"file":"github.com/openziti/edge@v0.24.12/router/xgress_edge/fabric.go:74","func":"github.com/openziti/edge/router/xgress_edge.(*edgeTerminator).close","level":"info","msg":"removing terminator on controller","terminatorId":"4l5VOO3g3XPZ0tp8vOFQg3","time":"2022-11-22T08:12:46.537Z","token":"$TOKEN1"}
Nov 22 08:12:46 zt ziti-router[271796]: {"_context":"ch{edge}-\u003eu{classic}-\u003ei{bkDW}","file":"github.com/openziti/edge@v0.24.12/router/xgress_edge/listener.go:114","func":"github.com/openziti/edge/router/xgress_edge.(*edgeClientConn).HandleClose","level":"warning","msg":"failed to remove terminator for session on channel close","terminatorId":"4l5VOO3g3XPZ0tp8vOFQg3","time":"2022-11-22T08:12:46.540Z","token":"$TOKEN1"}
Nov 22 08:12:47 zt ziti-router[271796]: {"_context":"ch{edge}-\u003eu{classic}-\u003ei{koeO}","chSeq":1,"connId":0,"edgeSeq":0,"file":"github.com/openziti/edge@v0.24.12/router/xgress_edge/listener.go:302","func":"github.com/openziti/edge/router/xgress_edge.(*edgeClientConn).processBind","level":"info","msg":"created terminator","routerId":"HhEiZMAet","sessionId":"b8ab1333-b61b-4cc3-aec8-cfacb52d575b","terminatorId":"","time":"2022-11-22T08:12:47.663Z","type":"EdgeBindType","uuid":"invalid-uuid-size-of-0-bytes"}
# Controller logs on machines A:
Nov 22 08:12:46 zt ziti-controller[271702]: {"file":"github.com/openziti/edge@v0.24.12/controller/handler_edge_ctrl/remove_terminator.go:118","func":"github.com/openziti/edge/controller/handler_edge_ctrl.(*removeTerminatorHandler).RemoveTerminator","level":"info","msg":"removed terminator","routerId":"HhEiZMAet","serviceId":"7LBEyG0NFDL2hY1dAWZ701","terminator":"4l5VOO3g3XPZ0tp8vOFQg3","time":"2022-11-22T08:12:46.539Z","token":"$TOKEN2"}
Nov 22 08:12:46 zt ziti-controller[271702]: {"error":"invalid terminator: not found","file":"github.com/openziti/edge@v0.24.12/controller/handler_edge_ctrl/common.go:318","func":"github.com/openziti/edge/controller/handler_edge_ctrl.(*baseSessionRequestContext).verifyTerminator","level":"error","msg":"terminator not found","operation":"remove.terminator","terminatorId":"4l5VOO3g3XPZ0tp8vOFQg3","time":"2022-11-22T08:12:46.540Z"}
Nov 22 08:12:46 zt ziti-controller[271702]: {"_context":"ch{HhEiZMAet}-\u003eu{classic}-\u003ei{erpQ}","error":"invalid terminator: not found","file":"github.com/openziti/edge@v0.24.12/controller/handler_edge_ctrl/common.go:75","func":"github.com/openziti/edge/controller/handler_edge_ctrl.(*baseRequestHandler).returnError","level":"error","msg":"responded with error","operation":"remove.terminator","routerId":"HhEiZMAet","time":"2022-11-22T08:12:46.540Z","token":"$TOKEN2"}
Nov 22 08:12:47 zt ziti-controller[271702]: {"_context":"ch{HhEiZMAet}-\u003eu{classic}-\u003ei{erpQ}","file":"github.com/openziti/edge@v0.24.12/controller/handler_edge_ctrl/create_terminator.go:122","func":"github.com/openziti/edge/controller/handler_edge_ctrl.(*createTerminatorHandler).CreateTerminator","level":"info","msg":"created terminator","routerId":"HhEiZMAet","service":"MACHINE_B","serviceId":"7LBEyG0NFDL2hY1dAWZ701","terminator":"4v7NZmXgXeU3BsI8kuZl0x","time":"2022-11-22T08:12:47.662Z","token":"b8ab1333-b61b-4cc3-aec8-cfacb52d575b"}
The service is super simple for host and intercept:
# Intercept assigned to machine A (ZTNA)
{
"addresses": [
"10.10.0.1/32"
],
"portRanges": [
{
"high": 22,
"low": 22
}
],
"protocols": [
"tcp"
]
}
# Host assigned to machine B (ZTHA)
{
"address": "192.168.3.2",
"port": 22,
"protocol": "tcp"
}
I think maybe the first thing I’ll try is to increase the CONNECT TIMEOUT SECONDS option significantly in the dialOptions of the intercept config and see if that helps with the issue. I don’t see a similar timeout option in the host configuration, at least in the web UI, and that seems like it would probably be where we need it since it appears to be the Machine B with the host config that is doing the disconnecting from the controller.
Does anyone have other ideas on how to resolve this?
Versions in use at the moment are:
NAME VERSION
ziti v0.26.10
ziti-controller v0.26.10
ziti-prox-c not installed
ziti-router v0.26.10
ziti-tunnel v0.26.10
ziti-edge-tunnel v0.20.6