TCP conn limit is 512 -- ZET not allowing sufficient conns

Hi @TheLumberjack, thanks for the feedback.

Hmm... The scenario is: network A has a ziti-edge-tunnel gateway to machines on network B where there is no single ziti gateway, but rather those machines each have ziti-edge-tunnel installed for "host model" usage.

Machines on Network A <---> Network A ZT Gateway <---> Host A machine with Ziti
                                                 <---> Host B machine with Ziti
                                                 <---> Host C machine with Ziti

Considerations:

  1. Machines on network A and B run a mesh protocol where machines probe each other for liveness/availability mostly through UDP, but also using TCP as a fallback protocol.

  2. Host machines in network B are servers routinely under heavy load

Trigger:

We re-scaled our network A size larger (few dozen more machines) which seems to have triggered hitting this TCP limit today.

Impact:

We noticed this issue immediately after scaling when a monitoring server in network A started reporting failed scrapes to target machines in network B, due to the monitoring server being denied the connection through the ziti network overlay.

Thoughts:

  • At face value, it would appear that more mesh probes between network A and B due to increased network A cluster size were responsible for hitting the limit. However, looking at ziti-edge-tunnel logs on machines in network B, there are only one or two mesh probes coming in per second, and so that doesn't seem to be the sole cause even though it may have been the trigger. TCP activity on network A machines or the gateway itself don't reach 512 connections either.

  • Network B machines are busy servers. While TCP conn load from ziti itself from network A is small on these network B machines, these servers do have constant significant TCP load from external sources, example:

# netstat -an | awk '/tcp/ {print $6}' | sort | uniq -c
    226 CLOSE_WAIT
     41 ESTABLISHED
     17 FIN_WAIT2
     47 LISTEN
   3503 TIME_WAIT
  • However, just with the TIME_WAIT count alone being far above 512, these network B servers in ziti host mode should have been causing the ziti gateway these errors for a long time. Looking further at the networking metrics on the network B machines, though, I can see that there is perfect time correlation between the IPv4 used socket count and these ziti gateway errors starting, where the used IPv4 socket count is now above 512:
# cat /proc/net/sockstat
sockets: used 684
TCP: inuse 294 orphan 0 tw 2414 alloc 308 mem 229
UDP: inuse 96 mem 6
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
  • In looking at the metrics history for network B machines, we have been just barely below the 512 IPv4 socket use count for weeks, and increasing network A size today pushed us above that threshold number on network B machines.

  • So if I'm interpreting this correctly, it seems that really busy servers won't be able to utilize ziti traffic as the remote end of the ziti tunnel will get TCP limit errors right away due to traffic unrelated to ziti. It seems kind of odd to me that the machine that logs the TCP limit errors appears to not be the machine that is the one actually hitting the TCP limit.

  • Initially, I was thinking maybe I could reduce the OS time parameter for TIME_WAIT to reduce that count, but since it looks like the threshold is tied to socket use count, not sure that will help.

  • For considering use of a ziti-router at the gateway, the ziti router has a higher or no default tcp limit? We are using the gateway for intercept functionality as well, and I think last time I looked at using tproxy router mode, it presented some problems for us in the way it wanted to enforce how we would use DNS on the machine. Let's see if I can find that thread... Yup, here -- not sure if these prior concerns are still valid in the latest version of the router:

The tproxy option for a ziti-router server tests for primary resolver access to DNS and fails if it doesn’t have it. Why does it need this? The client tunneler documentation here describes ziti-tunnel having similar requirement to be the primary resolver but also mentions that:

“… The nameserver will only answer queries for which it is authoritative i.e. OpenZiti Services’ domain names, and so you will also need a secondary, recursive resolver.”

In the case of ziti-tunnel which I haven’t tried, I assume the secondary recursive resolver can be set with the --resolver flag. However, I don’t see any similar option in the ziti-router command line or config options when running in tproxy mode. Did I miss seeing how to set that config somewhere for ziti-router?

  • So, not sure we can readily switch to the router as a tproxy. Probably a custom build with a higher limit will be the easiest, but between nix and nixos peculiarities and not having built the ziti product before (we've been just leveraging the release binaries so far), not sure if we'll run into some rabbit holes there. I suppose worst case we can revert to wireguard for a short time if we need to.