Hi @scareything, thank you!
No, I hadn’t really thought through the double TCP connection for interception, although that makes sense.
Yes, the machine which is throwing TCP conn limit errors is the ziti-edge-tunnel service on the network A gateway.
So based on your comments, it makes sense then that:
- The gateway machine that is logging the TCP limit in the ZET service is really the machine with the problem.
- Not readily finding OS metrics to indicate there is usage over 512 TCP conns on the gateway machine doesn’t mean it hasn’t hit the limit; some of those conns aren’t being counted from OS metrics
- The perfect timing of the network B host machines metrics climbing above 512 sockets in use coinciding with the alerting of TCP limit hit on the gateway machine ZET was coincidental and the TCP limit is not tied to the network B machine.
Feature suggestion: add the current number of consumed TCP conns and any other similar parameters with hard limits to the tunnel_status CLI output since a user can’t obtain this from OS metrics.
Thank you very much for the branch… I was able to pull a CI artifact and try it. After a few minutes, it is also reporting hitting a 1024 TCP limit… Current socket usage and TCP conn info (which may not be helpful based on comments above) is:
# cat /proc/net/sockstat
sockets: used 382
TCP: inuse 39 orphan 0 tw 26 alloc 59 mem 9
UDP: inuse 146 mem 8
UDPLITE: inuse 0
RAW: inuse 1
FRAG: inuse 0 memory 0
# netstat -tn | awk '/tcp/ {print $6}' | sort | uniq -c
34 ESTABLISHED
18 SYN_RECV
25 TIME_WAIT
So maybe we have a misconfig or a leak for some reason, although we haven’t touched the setup or config in a few months.
In looking further at the ziti-router logs (also co-located on the same server with the ZET service which is reporting the TCP limit), I’m seeing alot of the following errors with different circuit ID:
{"_context":"{c/ZyajblpHn|@/Qwbk}\u003cTerminator\u003e","circuitId":"ZyajblpHn","error":"cannot forward payload, no destination for circuit=ZyajblpHn src=Qwbk dst=Nq05","file":"github.com/openziti/fabric@v0.22.24/router/handler_xgress/receive.go:35","func":"github.com/openziti/fabric/router/handler_xgress.(*receiveHandler).HandleXgressReceive","level":"error","msg":"unable to forward payload","origin":1,"seq":1,"time":"2023-05-17T22:46:51.144Z","uuid":"invalid-uuid-size-of-0-bytes"}
Then, every few to several minutes, a large block of code with the following messages, each line with a different circuitId:
{"circuitId":"SW95VBnFi","ctrlId":"$FQDN","file":"github.com/openziti/fabric@v0.22.24/router/forwarder/scanner.go:85","func":"github.com/openziti/fabric/router/forwarder.(*Scanner).scan","idleThreshold":60000000000,"idleTime":140640000000,"level":"warning","msg":"circuit exceeds idle threshold","time":"2023-05-17T22:18:45.459Z"}
Ziti router and controller in this set up are running 0.27.5.
I’m not sure how to debug the above router errors, so any suggestions are welcome. If I restart the ZET service so it is not hitting the connection limits for at least the first few minutes, it seems like all the connections which should be transiting between network A and B are working as far as I can tell, but yet the router log is still emitting the above messages. Restarting services for ZET, router, controller don’t change the behavior.