High load during creation and enrolment of new identities

Hi Again, I've been doing some load testing on my HA Ziti system v1.5.4.

Creating and enrolling many identities concurrently results in high load on the HA Controller that's doing the work.

I have three HA Controllers in my network. The client traffic is spread relatively evenly between them resulting in relatively even system load . When identity creation and enrolment happens, it significantly, albeit temporarily increases load on the Controller that was used to create the identity.

I was hoping that i could avoid load spikes on individual Controllers by adding another Controller to the network which only provides the necessary services for creating and enrolling identities however when i remove items from the web.apis config i get errors.

Is it possible to do what i would like or is there something else that can be recommended ?

What kind of errors do you get? Errors on the controller or from SDK applications?

Thanks @plorenz.

I was hoping to have a dedicated enrolment Controller which only provides the necessary API endpoints for identity creation and enrolment.

I realise now that the edge-client API provides the endpoints for client/router enrolment so i shouldn't disable this.

Perhaps you can recommend a solution to my specific problem.

Sometimes i need to enrol up to 250 devices in a short space of time. This significantly increases load on the Controller doing the JWT enrolment. Is there some kind of rate limiting functionality i can implement to directly limit the enrolments ?

I see there are built in rate limiters but they don't seem to specifically affect enrolments.

Otherwise, is it possible to somehow configure a Controller to only provide API services for enrolment ?

I'm not sure there's a ready solution. There are two built in rate limiters, once for api sessions and one for model mutations.

The first one is meant handle the stampeding herd issue when a bunch of sdks come on line at the same time. It should be less relevant for people using OIDC auth, since that doesn't need to write to the DB.

The second is mostly aimed at terminator creation, since there can be a similar stampeding herd issue there.

Because enrollments are usually done from the target device, it makes it hard to limit the controller functionality. I'll check with @andrew.martinez and see if he has any thoughts, since he spends a lot more time thinking about enrollment than I do.

Paul

1 Like

If you use the built-in enrollment that generates certificates from the internal OpenZiti PKI, there is computational load to sign the certificates being issued. These are short but high CPU usage activities. Without looking at a pprof for your controller, I would guess that is what is causing the load. Unfortunately, there is no way to avoid that cost if OpenZiti is to issue the certificates.

There are ways to move that CPU cost (3rd Party CAs), but it adds additional complexity with PKI management. Which I don't suggest for non-enterprise situations where that infrastructure isn't already built and managed.

Rate-limiting enrollment requests introduces a few interesting new behavior patterns and options. For example, we could rate-limit the number of concurrent signing processes via a worker pool. Depending on the hardware and number of requests, this could cause some requests to time out or be served a `429 Too Many Requests.' At that point, existing routers, SDKs, and CLIs will error, as they don't know how to deal with that response. So any automation standing those up will have to decide how and when to retry.

Another option is a little cruder via simple HTTP outstanding request limiting. Which is similar, but not as specific as the above.

At this point, I am leaning towards adding a signing worker pool that can be configured both in terms of the number of workers and timeout.

The end result would be less controller load during high-enrollment windows, but devices might have to retry longer and the lower load would be sustained over a longer window.

2 Likes