High availability architecture for routers and controller

We have one controller that serves all private routers deployed on user's clusters .
To have an entrypoint as discussed with @qrkourier . We deploy min of 1 public router somewhere .

Here we are not sure how many users will grow and each can have many clusters .
so will not everyone overhelm the public router and controller at certain point ?
Does deploying more public routers distribute the traffic ?

How to measure this and take appropriate architecture and resources to deploy the openziti system without any downtime

Let's start with router topologies because the multiple controllers HA feature hasn't reached maturity.

A good heuristic for public router placement is two per geographic region. This gives you a minimal balance of reliability and performance.

Keep in mind that all routers, private and public, will call all public routers (routers with link listeners) to form links. Each router link provides bi-directional fabric transport for services. This is the mesh backbone of the Ziti network. Ziti will automatically find the best path through the mesh, so you only need to provide sufficient points of presence for the mesh to perform well.

Private routers can help with performance too because they have direct access to public routers. For example, you can place a private router inside a cluster with an edge listener like ziti-router-edge.zitinamespace.svc.cluster.local and grant permission for the identities inside the same cluster with an edge router policy. Those identities will try all routers they have permission for and use the fastest to respond.

1 Like

thanks @qrkourier . Any idea on how would the router , controller pods react where there is lot of load . In terms of its connection handling capabilities , cpu's , memory .. etc . Like tell me what metrics change can I expect for a particular situation .

Are only 2 public routers per geographic region enough or there should be algorithm that like public routers = (private-routers/10) . Something like that ??

It is a very nuanced subject. Fundamentally, the CPU utilization is directly proportional to packets per second, independent of packet size, and also to circuit setup counts at a lesser level. In our commercial offerings, we see a lot of different models, high throughput, low connection count, small throughput, high count, and everything in between. The capacity of these networks is dependent on these models. To manage the system we monitor node statistics, CPU, memory, disk IO, etc. and alert on them at various levels for manual review. For example, we send an alert at 80% of CPU utilization. In all of our networks, this is sufficient to allow analysis and response by our Operations tea; we have similar alerts for memory and disk IO based on the known limitations of the infrastructure, at around 75% of the hard limit.

In the end, the only way to operate the system well is to understand how your own call model is affected by different events. In most models, modest resource levels are fine. (2CPU by 4GB RAM). As you load traffic into the system, monitoring these fundamental resources can give you a dataset for capacity engineering and operations.

2 Likes

thanks @mike.gorman . That's insightful

@mike.gorman any ideas on testing the load ? which helped u make right decisions on how much resources to allocate for a particular router or controller

It's incredibly difficult to provide sizing guidelines and network scale testing is very difficult to model accurately in general.

Imo, the best approach is to first define your expected traffic scenario, attempt to emulate that scenario the best you can with cloud-based testing, starting with exceptionally modest hardware and scaling up that hardware as necessary until adding resources doesn't improve performance. At that point you need to dig in and try to figure out where the bottleneck is and why it's happening.

Even the best emulated testing will end up having surprises you never expected once deployed, sometimes leading to refined testing, sometimes leading to lessons learned because emulating that scenario is sometimes just not easy to do.

I personally believe that you should attempt to model your traffic as closely to your expected traffic patterns as possible and when possible use the same tools, clients, server software you plan to deploy.

There's no single answer here and every solution is unique. This testing is also incredibly time intensive as accurate testing takes lots of time and can be difficult to get right, leading to multiple iterations.

I went back and looked at some old modeling data I had from an earlier project. I collected a large sample from existing networks of Edge Router packets per second and CPU load. That data shows a directly proportional relationship at approximately 100Kpps/CPU. The systems were all cloud based nodes, probably 2.4GHz processors, though I didn't record that. That should give you scale. There is a fair distribution around that line, however, the estimate seems conservative for larger loads; the sample is weighted heavily by the lower packet numbers, where the base processing of the application is going to show some load with very few packets forwarded.

That said, I would target 80% utilization at half that number (50Kpps) as a start, and then start watching your own numbers to get a better idea of what your network model does. I would not go below 2 GB RAM except for testing. The base memory load of the runtime is a significant portion of 1GB, so it leaves little room for any events, and like any software, out of memory is a very bad place to be.

As far as controllers go, we haven't been as successful modeling, as there are so many more variables to consider. That said, of a few hundred networks we operate, the majority of in use networks are no more than 4CPU/8GB RAM nodes. That includes one network with a few hundred identities and ERs that averages over 5M call events per day. So 2 CPU/4GB RAM is usually OK for smaller networks, 4/8 if you intend to have a large network, then monitor for capacity beyond that.

I hope this all helps.

2 Likes