Terminator creation performance degradation

It seems like the problem is likely at the router layer then, since the interaction between the controller and router should be the same regardless of the source of the terminators. I'll try to shift my test setup in the direction of your setup, with more clients and a single terminator per service and see if I can get it to fail.

For reference. Below is the configuration i had usually been applying.

This gives me.

  • 1 service
  • 1 intercept.v1 config
  • 1 host.v2 config
  • 1 Dial service-policy
  • 1 Bind service-policy
  • 20 edge-router-policies

Each identity is assigned an identity role ctrl-group-x which i use to group identities to spread them evenly across routers.

The end result is to be able to address a client by its identity name. E.G ssh user@<identity.name>.

# service-edge-router policy allowing all services to route over all routers. (#all is a special role).
ziti edge create service-edge-router-policy all --service-roles '#all' --edge-router-roles '#all'

# edge-router policy allows users with the role '#admin' to use any router.
ziti edge create edge-router-policy ADMIN --edge-router-roles '#all' --identity-roles '#admin'

# 20 edge-router-policies to provide even spreading of identities across routers. 
for ((i = 1; i <= 20; i++)); do
  ziti edge create edge-router-policy "GROUP-${i}" --edge-router-roles "#ctrl-group-${i}" --identity-roles "#ctrl-group-${i}"
done

ziti edge create config ssh.cfg.intercept intercept.v1 "{
    \"addresses\": [\"*.${ZITI_IDENTITY_DOMAIN}\"],
    \"protocols\": [\"tcp\"],
    \"portRanges\": [{\"low\":22,\"high\":22}],
    \"dialOptions\": {\"identity\": \"\$dst_hostname\"}
}"

ziti edge create config ssh.cfg.host host.v1 '{
    "address": "127.0.0.1",
    "protocol": "tcp",
    "port": 22,
    "listenOptions": {
      "identity": "$tunneler_id.name",
      "maxConnections": 1
    }
}'

ziti edge create service ssh \
  --configs ssh.cfg.intercept,ssh.cfg.host \
  --role-attributes admin,ctrl

ziti edge create service-policy ssh.dial Dial --identity-roles "#admin" --service-roles "@ssh"
ziti edge create service-policy ssh.bind Bind --identity-roles "#ctrl" --service-roles "@ssh"

I just tested with 900 clients/50 services each. I found a bug where the max connections wasn't being honored, and I ended up with ~135k terminators. They were created in a reasonable amount of time, maybe 20-30 minutes. I'm going to try with more clients and the bug fixed.

With my 1 terminator per client use case i start to experience noticeable slow down in terminator creation at about 6-7k.

I ran a test case with 2700 identities (my setup uses three regions, so I'm doing things in multiples of threes). Each identity has 50 services, 1 terminator for each, so should be reaching 135k terminators. I did see a slowdown around 11k terminators. I looked at the controller stackdump, and I think what's happening is a bottleneck on authenticating and getting the services list. At a certain point it sped up again and finished. All together it took about 35 minutes. I'm going to do one more test with a single service per identity and more identities and see if I can tell where things are getting hung up. My guess is that it's not the actual terminator creation, though.

1 Like

I tested with about 20k clients, each with a single terminator. What I found was that it was an auth thundering herd problem.

It's especially dire for test setups where every identity is authenticating for the first time.

If the identity env info has changed, it will update the identity.
For various reasons, the identity authenticator will also generally be updated.
Both of these writes will cause authentication to be bottle-necked, waiting for the writes to finish.

A related issue is that we have a tls handshake rate limiter, but it's disabled by default. With it disabled, the system can become cpu bound, handshakes can start timing, causing clients to retry and putting yet more load on the system.

My test setup also had two issues that you probably don't have:

  1. The Go SDK will check to see if it can use OIDC by checking the /versions API. If this times out or is rejected because of the TLS handshaking timing out, the system will fallback to using legacy auth/sessions. This will further increase the load on the system because it will be persisting sessions.
  2. I created all of my identities on a single controller, to speed things up, then expanded my cluster after the model was in place. This sped up model creation, but it meant all my identities only had one controller in the configuration. All the auth load was then concentrated on that single controller.

After I enabled the tls handshake rate limiter, made the identity and authenticator writes happen in the background with size limited queues and fixed the versions check timeout I was able to get the 20k terminators created in a reasonable amount of time.

I'm working on making the following changes:

  1. Make the tls handshake rate limiter enabled by default
  2. Andrew is looking at make changes to enrollment, so that the authenticator doesn't need any changes on initial auth
  3. Making identity sdk env writes backgrounded and optionally allow them to be dropped if the system is too busy
  4. Fix the Go sdk /versions check: loop until is succeeds with some waits in between. This matches the c-sdk behavior.

Cheers,
Paul

1 Like

Thanks again @plorenz .

Sounds like you’ve made some excellent progress on this.

When i saw the identity and authenticator were getting updated it did occur to me that it wouldn’t exactly be helping the situation.

The changes you suggested seem logical so i will stand by while you work on them.

Many thanks.