Terminator creation performance degradation

Hi Again,

i have a HA system which consists of 3 Controllers and 10 Edge Routers on V1.6.8. Iโ€™ve been running some load tests to understand how the system handles high levels of terminator creation which could happen in a scenario where many ZET clients are bought online at roughly the same time. I have 40k identities enrolled to the system with policies which spreads them all evenly over the available routers.

In my test i bring the all the ZET clients up over a period of 1 hour. The ZET clients have a polling interval of 1800s (30m). What i notice is that as more terminators exist on the system, the time to create a terminator increases exponentially. For example, the first 10k terminators will get created almost as soon as the ZET client comes online. The next 10K might take 1 hour but the final 20k will take 5 hours. If terminator creation per minute was charted on a graph, it would be seen as a logarithmic curve.

I understand that the bbolt database allows only one read-write transaction at a time and therefore only the Raft leader can apply to the data model sequentially. but iโ€™m not sure this explains entirely why there is performance degradation in this way.

I have tried various combinations of more/less controllers and routers. Increasing VM resources obviously does improve the results but CPU/Memory utilisation is always nominal as well as IO. Iโ€™ve adjusted the rate limiters and queues but there is always this slow down in creating terminators and other entities.

During periods when high levels of terminator creation is required i do notice many occurrences of UnknownTerminator in Controller logs which triggers the terminator to be deleted which i am sure is slowing things down but iโ€™m not sure this is the core reason as the numbers donโ€™t correlate.

{"file":"github.com/openziti/ziti/controller/network/router_messaging.go:312","func":"github.com/openziti/ziti/controller/network.(*RouterMessaging).sendTerminatorValidationRequest","level":"info","msg":"queuing validate of terminator","terminatorId":"6OSD6UWKsv3zhMVO8Vcxt2","time":"2025-10-09T10:33:20.015Z"}
{"file":"github.com/openziti/ziti/controller/network/router_messaging.go:594","func":"github.com/openziti/ziti/controller/network.(*terminatorValidationRespReceived).handle","level":"info","msg":"terminator not valid, queuing terminator delete","reason":"UnknownTerminator","terminatorId":"6OSD6UWKsv3zhMVO8Vcxt2","time":"2025-10-09T10:33:20.016Z"}

Hi @farmhouse ,

Thank you for reporting this. My scale testing for terminator creation has usually testing in the range of 5k-10k terminators, so I haven't hit this myself.

I agree that the single writer shouldn't cause degradation of that severity. There's also some amount of exponential back-off happening, but that should cap at a reasonable level.

I'll have to run some tests and see what I can find. I created an issue to try and make sure we don't lose track of this, as I likely won't be able to get to the testing immediately: Terminator creation seems to slow exponentially as the number of terminators rises from 10k to 20k to 40k ยท Issue #3318 ยท openziti/ziti ยท GitHub

1 Like

Thanks @plorenz. Do let me know if i can provide any help investigating or testing further.

@plorenz Hope youโ€™re well. Donโ€™t suppose youโ€™ve had a chance to look at this ?

Or any other thoughts on avoiding it altogether ?

Hello. I haven't had a chance to dig into this much. I'm spinning up a test now with 30k terminators to see if I can duplicate the issue. If the problem exists in the router or controller, I should hopefully be able to pinpoint it.

I did an initial run without HA, and the 30k terminators were created relatively quickly, so it's possible it's just related to slowdowns with raft. If that's the case, I do have a couple of ideas on how that might be sped up that I can test out.

I'll let you know what I find,
Paul

Hi @farmhouse

I think I may have made some progress. I found and fixed a couple of issues:

  1. When doing rate limiting on the raft commands, the rate limiter was waiting for the command to submitted and also waiting for the response to come back. I tried using the rate limiter only on command submission and then using a separate semaphore to limit how many commands could be in-flight at once. This seemed to speed things up considerably, as I was unwittingly preventing any raft batching from taking place.
  2. If the leader became unresponsive due to load, create terminator commands would start flowing through other controllers. If these timed out getting a response from the leader, the terminator would be deleted and the operation would have to start from scratch. I think this is what you were seeing in your setup. I fixed it so that a timeout is interpreted as 'controller too busy', which just affects the router side rate limiting.

Without those fixes, if I didn't hit the second issue, it was doing about 12 creates per second, so getting to 30k terminators would be about 40 minutes. If I did hit the second issue, it slowed way down and I didn't bother waiting for it to finish. I imagine it would have been at least a few hours for it to complete.

With both fixes, it takes 4-8 minutes to work.

As the fix should improve throughput for all model mutations, I'm pretty happy to have found this. I still need to review it with the team, I'll let you know if any problems crop up.

Let me know if you have any interest in testing a pre-release, to ensure that your problem is solved before we do a release with the fix.

Thank you,
Paul

@plorenz Thanks so much for investigating this.

Those issues youโ€™ve identified sound like what i experienced.

I certainly can help test a pre-release. Iโ€™m running the infrastructure components as containers nowadays.

I'll send you a note and a link when a pre-release is available. It'll probably be a few more days. I've found a few more bugs while I was testing so I'm fixing those and doing some more testing and validation while I'm in there.

Paul

1 Like

Hiya @plorenz . Hope youโ€™re well. Any news on this? I notice you made some fixes that seem to have made their way into 1.8.0-pre3 pre-release.

Yes! Sorry, forgot to ping you. If you want to give 1.8.0-pre3 a try, you should hopefully see better performance.

I still need to do another pass on the Go SDK, as my testing shows that we don't always return to the expected number of terminators when there's a lot of system churn. We often end up with a couple of extra terminators (if there are available routers). I know what the issue is, but I've got a few other high priority items to get through before I get to that. Since the number of terminators doesn't exceed the number of available edge routers, it's not a critical bug.

Let me know how the changes work for you.
Thank you,
Paul

Thanks.

Would you recommend i go with the defaults youโ€™ve set for the raft.ratelimiter parameters you describe here ? Perhaps i should increase the maxSize ?

And what should i do with the existing commandRateLimiter and cluster.commandHandler parameters ?

Hereโ€™s my existing Controller config, iโ€™d appreciate if you could take a look.

v: 3

commandRateLimiter:
  enabled: true
  maxQueued: 300

tls:
  handshakeTimeout: 30s
  rateLimiter:
    enabled: true
    minSize: 100
    maxSize: 1000

cluster:
  dataDir: /etc/ziti/config
  trailingLogs: 2500
  snapshotInterval: 3m
  snapshotThreshold: 250
  maxAppendEntries: 1000
  minClusterSize: 1
  commandHandler:
    maxQueueSize: 1000

identity:
  cert: /etc/ziti/config/pki/server.chain.pem
  key: /etc/ziti/config/pki/server.key
  ca: /etc/ziti/config/pki/ca-bundle.pem

ctrl:
  listener: tls:0.0.0.0:7443
  options:
    advertiseAddress: tls:ctrl-1:7443
    maxQueuedConnects: 5000
    maxOutstandingConnects: 1000
    connectTimeoutMs: 2000
    writeTimeout: 15s

events:
  stdoutLogger:
    subscriptions:
      - type: entityCount
        interval: 60s
      - type: cluster
      - type: router
    handler:
      type: stdout
      format: json

edge:
  api:
    sessionTimeout: 45m
    address: ctrl-1.edge.example.com:443
  enrollment:
    signingCert:
      cert: /etc/ziti/config/pki/intermediate.crt
      key: /etc/ziti/config/pki/intermediate.key
    edgeIdentity:
      duration: 10m
    edgeRouter:
      duration: 10m

web:
  - name: client-api
    bindPoints:
      - interface: 0.0.0.0:8443
        address: ctrl-1.edge.example.com:443
    options:
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
      idleTimeout: 5000ms
      readTimeout: 5000ms
      writeTimeout: 100000ms
    apis:
      - binding: edge-client
      - binding: edge-oidc

  - name: mgmt-api
    identity:
      cert: /etc/ziti/config/pki/client.crt
      key: /etc/ziti/config/pki/client.key
      server_cert: /etc/ziti/config/pki/server.chain.pem
      server_key: /etc/ziti/config/pki/server.key
      ca: /etc/ziti/config/pki/ca-bundle.pem
      alt_server_certs:
        - server_cert: /etc/ziti/config/pki/dashboard.crt
          server_key: /etc/ziti/config/pki/dashboard.key
        - server_cert: /etc/ziti/config/pki/mgmt.crt
          server_key: /etc/ziti/config/pki/mgmt.key
    bindPoints:
      - interface: 0.0.0.0:10443
        address: ctrl-1:10443
      - interface: 0.0.0.0:9443
        address: ctrl-1.mgmt.example.com:443
    apis:
      - binding: health-checks
      - binding: fabric
      - binding: edge-oidc
      - binding: edge-management
      - binding: metrics
        options: {
          includeTimestamps: true
        }
      - binding: zac
        options:
          location: /ziti-console
          indexFile: index.html
  1. I would start with the defaults and only change them if you run into trouble. Because it's adaptive, the window has a wide range, and it should settle somewhere that works for your setup. Additionally, the actual batching is controlled by raft parameters, so changing these settings might need to be done in tandem with changing raft's batch settings. If you do experiment with these settings, I'd be curious to hear the results
  2. The commandRateLimiter is only used for non-HA setups now, so you can remove it
  3. cluster.commandHandler.maxQueueSize is used to determine the size of the queue on the handler in the current controller accepting model updates from other controllers. If the queue fills up, incoming requests will be rejected to provide back-pressure. 1000 seems like a reasonable value. Smaller is probably ok too.

Hope that's helpful,
Paul

Thanks again. To be clear, when you sayโ€ฆ

Are you talking about some other settings i can modify ?

See: Controller Configuration Reference | NetFoundry Documentation

The maxAppendEntries sets the max raft batch size.

1 Like

Hi @plorenz . Iโ€™ve had a chance to give v1.8.0-pre3 a good test now.

I am still experiencing the exponential increase in time that it takes to create terminators.

I certainly no longer experience the terminator not valid, queuing terminator delete issue from the older versions. And it indeed appears that terminators no longer get deleted during floods of terminator creation, the creations just take exponentially longer for some reason.

My infrastructure components now consist of 3 controllers, 12 routers. With 30k clients spread evenly across the routers.

Once the terminator count is above ~10k I start seeing frequent SERVER_TOO_MANY_REQUESTS and server too busy log messages. At ~15k the terminator creation was happening at roughly 1 per second and i realised it is going to take a long time.

I see some unfamiliar log entries that i assume are associated with the changes in this version. For every created terminator i see a log entry of 2025/12/11 15:30:48 Rollback failed: tx closed

Iโ€™ve attached a log snippet taken during mass terminator creation. The terminator count is about 3300 at this stage. I wonder if there is anything in there thatโ€™s standing out to you.

log.zip (95.5 KB)

Thank you for pointing out the 2025/12/11 15:30:48 Rollback failed: tx closed messages. I think that's due to updating the raft boltdb library to an unstable version. That's been fixed.

In my testing I was seeing that 30k terminators would take around 7 minutes to create. However, that was with the Go SDK on the client side. What are you using for hosting? I'm wondering if there's a difference in client behavior, maybe changes to exponential back-off or something similar, that is causing the issue.

I should be able to get back to testing on this in the next couple of weeks, as I finished the project that took focus.

Let me know,
Thank you,
Paul

Thanks @plorenz.

Iโ€™m currently running the Controllers and Routers in a three node AKS Cluster using Standard_D8ads_v5 nodes with premiumV2 storage. I do also use a local cluster for testing which consists of 3 Intel N100 โ€œmini pcโ€™sโ€. Iโ€™ve created my own Helm charts to run the Controller and Router workloads as separate StatefulSets. 1 Controller per node. 12 Routers spread evenly across the nodes. Happy to share the charts. Theyโ€™re inspired from your non HA ones.

I run the ZET clients on separate VMs. Usually about 10k per VM.

Im using third party CA auto enrolment which seems to work as expected although the Controller does log โ€˜certificate signed by unknown authorityโ€™ when each client connects.

Hi @farmhouse , I reran the test today, both with the Go SDK (ziti tunnel) and C-sdk (ziti-edge-tunnel). Both of them were able to create the 30k terminators in about 3 minutes:

[  15.248]    INFO fablab/cmd/fablab/subcmd.(*startAction).run: 200 components started

$ fablab exec validate
[   7.154]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl2]} current terminator count: 770, elapsed time: 6.133213627s
[   9.120]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl3]} current terminator count: 770, elapsed time: 7.154591343s
[  10.837]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl1]} current terminator count: 701, elapsed time: 8.217412861s
[  68.783]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl2]} current terminator count: 10868, elapsed time: 1m7.7620659s
[  71.279]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl1]} current terminator count: 11252, elapsed time: 1m8.659402461s
[  73.539]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl3]} current terminator count: 11556, elapsed time: 1m11.573863463s
[ 129.839]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl2]} current terminator count: 22501, elapsed time: 2m8.817993024s
[ 131.781]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl1]} current terminator count: 22850, elapsed time: 2m9.161210647s
[ 137.935]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl3]} current terminator count: 23812, elapsed time: 2m15.969907928s
[ 166.536]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl2]} all terminators present, elapsed time: 2m45.514425696s
[ 166.790]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl3]} all terminators present, elapsed time: 2m44.824358149s
[ 166.961]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl2]} started validation of 6 routers
[ 167.638]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl3]} started validation of 6 routers
[ 170.138]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl2]} sdk terminator validation of 6 routers successful
[ 170.419]    INFO main.validateTerminatorsForCtrl: {ctrl=[ctrl1]} all terminators present, elapsed time: 2m47.798881057s
[ 170.646]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl1]} started validation of 6 routers
[ 174.176]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl3]} sdk terminator validation of 6 routers successful
[ 176.012]    INFO main.validateRouterSdkTerminators: {ctrl=[ctrl1]} sdk terminator validation of 6 routers successful

$ ziti fabric list terminators 
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ID                     โ”‚ SERVICE      โ”‚ ROUTER      โ”‚ BINDING โ”‚ ADDRESS                โ”‚ INSTANCE โ”‚ COST โ”‚ PRECEDENCE โ”‚ DYNAMIC COST โ”‚ HOST ID    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 100GIgy15OvYEfJZkzCB7r โ”‚ service-1310 โ”‚ router-ap-0 โ”‚ edge    โ”‚ 100GIgy15OvYEfJZkzCB7r โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ cvcmWZw3XL โ”‚
โ”‚ 100nLc1kTviceIfjsdCeSA โ”‚ service-1596 โ”‚ router-us-0 โ”‚ edge    โ”‚ 100nLc1kTviceIfjsdCeSA โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ hXc0OZT3v  โ”‚
โ”‚ 101YKw35pbxLy1l5MPZJ10 โ”‚ service-1214 โ”‚ router-eu-1 โ”‚ edge    โ”‚ 101YKw35pbxLy1l5MPZJ10 โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ EJUmOZTKX  โ”‚
โ”‚ 101eo06A5z6cxNYBG0pn9o โ”‚ service-0006 โ”‚ router-ap-0 โ”‚ edge    โ”‚ 101eo06A5z6cxNYBG0pn9o โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ J9x0OPwKXL โ”‚
โ”‚ 101lvINvbhxejM33rWxz9H โ”‚ service-1257 โ”‚ router-eu-1 โ”‚ edge    โ”‚ 101lvINvbhxejM33rWxz9H โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ .vN5OZw3X  โ”‚
โ”‚ 104OD62oH0HG42oWQXqBv2 โ”‚ service-1367 โ”‚ router-eu-1 โ”‚ edge    โ”‚ 104OD62oH0HG42oWQXqBv2 โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ nsmhWPTKX  โ”‚
โ”‚ 104vikQl0ojuMzM7woGhvZ โ”‚ service-1758 โ”‚ router-ap-0 โ”‚ edge    โ”‚ 104vikQl0ojuMzM7woGhvZ โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ XP40OPTKXL โ”‚
โ”‚ 104vkV5l5IrRJJNY1dVrNB โ”‚ service-0668 โ”‚ router-us-0 โ”‚ edge    โ”‚ 104vkV5l5IrRJJNY1dVrNB โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ gbYmOZTKX  โ”‚
โ”‚ 1065uI3ohhb1sKcNMUkjMO โ”‚ service-1275 โ”‚ router-eu-1 โ”‚ edge    โ”‚ 1065uI3ohhb1sKcNMUkjMO โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ AvnmOPwKX  โ”‚
โ”‚ 106R1AFEVCtktHlG9hdspr โ”‚ service-0622 โ”‚ router-eu-1 โ”‚ edge    โ”‚ 106R1AFEVCtktHlG9hdspr โ”‚          โ”‚    0 โ”‚ default    โ”‚            0 โ”‚ IHN5OPwKvl โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
results: 1-10 of 30000

My test setup is more spread out. It looks like:

$ fablab list hosts
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  # โ”‚ ID          โ”‚ COMPONENTS             โ”‚ REGION         โ”‚ INSTANCETYPE โ”‚ TAGS          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1 โ”‚ ctrl3       โ”‚ ziti-controller:    1  โ”‚ ap-southeast-2 โ”‚ c5.2xlarge   โ”‚               โ”‚
โ”‚  2 โ”‚ host-ap-0   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  3 โ”‚ host-ap-1   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  4 โ”‚ host-ap-2   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  5 โ”‚ host-ap-3   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  6 โ”‚ host-ap-4   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  7 โ”‚ host-ap-5   โ”‚ ziti-edge-tunnel:   10 โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ host,scaled   โ”‚
โ”‚  8 โ”‚ router-ap-0 โ”‚ ziti-router:    1      โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ router,scaled โ”‚
โ”‚  9 โ”‚ router-ap-1 โ”‚ ziti-router:    1      โ”‚ ap-southeast-2 โ”‚ c5.xlarge    โ”‚ router,scaled โ”‚
โ”‚ 10 โ”‚ ctrl2       โ”‚ ziti-controller:    1  โ”‚ eu-west-2      โ”‚ c5.2xlarge   โ”‚               โ”‚
โ”‚ 11 โ”‚ host-eu-0   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 12 โ”‚ host-eu-1   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 13 โ”‚ host-eu-2   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 14 โ”‚ host-eu-3   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 15 โ”‚ host-eu-4   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 16 โ”‚ host-eu-5   โ”‚ ziti-edge-tunnel:   10 โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 17 โ”‚ router-eu-0 โ”‚ ziti-router:    1      โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ router        โ”‚
โ”‚ 18 โ”‚ router-eu-1 โ”‚ ziti-router:    1      โ”‚ eu-west-2      โ”‚ c5.xlarge    โ”‚ router        โ”‚
โ”‚ 19 โ”‚ ctrl1       โ”‚ ziti-controller:    1  โ”‚ us-east-1      โ”‚ c5.2xlarge   โ”‚               โ”‚
โ”‚ 20 โ”‚ host-us-0   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 21 โ”‚ host-us-1   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 22 โ”‚ host-us-2   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 23 โ”‚ host-us-3   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 24 โ”‚ host-us-4   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 25 โ”‚ host-us-5   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 26 โ”‚ host-us-6   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 27 โ”‚ host-us-7   โ”‚ ziti-edge-tunnel:   10 โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ host          โ”‚
โ”‚ 28 โ”‚ router-us-0 โ”‚ ziti-router:    1      โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ router        โ”‚
โ”‚ 29 โ”‚ router-us-1 โ”‚ ziti-router:    1      โ”‚ us-east-1      โ”‚ c5.xlarge    โ”‚ router        โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Each host tunneler is hosting 50 services with three terminators per service hosts.
So 200 hosts * 50 services * 3 terminators per service = 30000 terminators

  • 3 controllers, one in each region
  • 6 routers, two in each region
  • 20 host machines, 10 tunnelers per host machine

Before I made the changes in 1.8.0-pre3, I was seeing much longer times to establish, where because of churn and bottlenecked RAFT, it might take hours.

I'm wondering if you're hitting some sort of resource constraint? We have seen issues where the controller suddenly slows down b/c it uses up its IOPS quota. I'm not up on Azure terminology, but I'm guessing premium storage would be less likely to hit a resource block. Also, my test controllers and routers look like they're lower spec than what you have.

10k clients for a VM also seems like a lot. Are you doing one service per client?

You could try grabbing some stack dumps from the controllers and routers, as well as cpu pprofs. There is some doc on how to do that here: How To Gather OpenZiti Diagnostics ยท openziti/ziti Wiki ยท GitHub

My test is also purely focused on SDK terminator creation. It's possible that in your live system, there may be other activity that's competing for resources and slowing things down.

Paul

Thanks again @plorenz .

Your client setup is slightly different to mine. I wonder if itโ€™s making the difference.

My use case requires that i have many individual clients, each only needing to host a single service (ssh). I donโ€™t really need the redundancy that multiple terminators per service gives so i have maxConnections set to 1 in my host.v1 config. I think iโ€™ll try a test with maxConnections set to 2 or 3 and run less clients and see how that affects things.

So, i think iโ€™m correct in saying that iโ€™ve got 1 service per client, which therefore results in 1 terminator per client.

So far my load tests consist of me running X thousand instances of ziti-edge-tunnel in background. Iโ€™ve attached the actual script i use to run this test. I stop at 10k instances as this seems to consume about 16GB RAM.

zet_mass_test.sh.txt (4.4 KB)

I do also see similar characteristics when i disconnect clients. For example, if iโ€™ve got 20k clients online and therefore 20k terminators. Initially, terminator deletion per second is relatively low and gets higher as there are less terminators.

The Azure Premium v2 storage allows configurable performance. I think iโ€™m at 4000 IOPS and 1000MB/s R/W throughput at the moment.

Are you running your component infrastructure on a VM ? Not docker or other ?

Good morning @plorenz .

I have been able to reproduce the terminator creation performance you experience by adjusting my configuration to more closely match yours.

With 50 services, 3 terminators per service, 200 ZET clients. This gets me to 30k terminators in ~4mins on my MiniPC cluster. Unfortunately this does not match my intended use case which requires me to have many clients with only 1 service per client.

Subsequently i have modified my configuration to try and provide for the high client count that i need. Iโ€™ve used identity roles to place identities logical groups of 1000. Each group of 1000 has 1 service and 1 bind service-policy associated. See my script below for doing this. I experience better terminator creation performance with this logical setup however i have a problem with dialing these services now. When i try to dial a service E.G ssh user@<identity.name>.example.device i get service w4nJZTZfqcQmJwdhF5xOR has no terminators error in my controller logs.

ZITI_IDENTITY_DOMAIN="example.device"

ziti edge create config ssh.cfg.intercept intercept.v1 "{
    \"addresses\": [\"*.${ZITI_IDENTITY_DOMAIN}\"],
    \"protocols\": [\"tcp\"],
    \"portRanges\": [{\"low\":22,\"high\":22}],
    \"dialOptions\": {\"identity\": \"\$dst_hostname\"}
}"

ziti edge create config ssh.cfg.host host.v1 '{
    "address": "127.0.0.1",
    "protocol": "tcp",
    "port": 22,
    "listenOptions": {
      "identity": "$tunneler_id.name",
      "maxConnections": 1
    }
}'

for ((i = 1; i <= 50; i++)); do

  ziti edge create service ssh.${i} \
    --configs ssh.cfg.intercept,ssh.cfg.host \
    --role-attributes admin,ctrl-group-${i}

  ziti edge create service-policy ssh.bind.${i} Bind --identity-roles "#ctrl-group-${i}" --service-roles "@ssh.${i}"

done

ziti edge create service-policy ssh.dial Dial --identity-roles "#admin" --service-roles "@ssh.1,@ssh.2,@ssh.3,@ssh.4,@ssh.5,@ssh.6,@ssh.7,@ssh.8,@ssh.9,@ssh.10,@ssh.11,@ssh.12,@ssh.13,@ssh.14,@ssh.15,@ssh.16,@ssh.17,@ssh.18,@ssh.19,@ssh.20,@ssh.21,@ssh.22,@ssh.23,@ssh.24,@ssh.25,@ssh.26,@ssh.27,@ssh.28,@ssh.29,@ssh.30,@ssh.31,@ssh.32,@ssh.33,@ssh.34,@ssh.35,@ssh.36,@ssh.37,@ssh.38,@ssh.39,@ssh.40,@ssh.41,@ssh.42,@ssh.43,@ssh.44,@ssh.45,@ssh.46,@ssh.47,@ssh.48,@ssh.49,@ssh.50"