Prevention of cascading failures during route scheduling

Hi, for a large site with multiple edge routers to choose from (say 3), are there any controls in place in OpenZiti to prevent cascading failures during route scheduling?

As an example, I'm thinking of a scenario where, say, edge router 1 fails, and all its clients move to edge router 2, which buckles under load from its own clients and new clients from (the now failed) router 1. At this point, all these clients (from router 1 & 2) are going to move to router 3 and bring it down as well.

How does that scenario play out? Is there any protection against it built into OpenZiti (e.g. exponential backoff with jitter, load shedding at routers, random selection of edge routers at clients, etc.)? What can I do as a network admin/designer/app developer to protect against such cascading failures?

Hi @ehsanj , that's an interesting question. SDKs will use whichever edge router currently has the lowest latency. On the terminating side of the circuit, we can load balance, using the number of circuits to try to avoid overloading any given terminator. On the ingress side, the SDK chooses the initiating edge router. The SDK doesn't have global knowledge, like the controller, so it can't make as good of a choice.

We have discussed various ways this could be improved. The most obvious, but also most difficult, is extending the fabric to the SDK. In the near term, there are some quick fixes we could make. Instead of always picking the lowest latency, we could set a threshold and randomly pick from ERs that are within X of the lowest latency, or doing weight random selection from the lowest N.

I don't know that an ER would fall over in the conditions you describe. The SDK would already have a connection to each of the ERs it has access to, so there wouldn't be any new connections being established. There would be some extra memory from new circuits being established, so you could hit a memory threshold and enter a GC spiral. Routers generally aren't CPU-bound. If the outgoing links were unable to sustain the throughput, the flow control would kick in and incoming traffic would be throttled. So there are some ways in which the router could be downed or impaired.

I think in practice this is mitigated by monitoring traffic and using ER policies to make sure that there's sufficient backup capacity on grouped edge routers.

I'm going to see if one of our ops folks wants to comment on if we've seen scenarios like this and if they have anything to add or correct.

I'm not sure this was a very satisfying answer. I will say there are some pain points related to SDK edge router selection that we've heard from users, so I expect we'll be tackling this in some form once HA is released.


Edit: Wanted to mention we do use exponential backoff with jitter in several places, just not here. We use it more for protecting the controller against stampedes, for example when establishing terminators, api-sessions and edge sessions. I think, as you noted, that smart/random selection is probably more appropriate here than backoff. We also have introduced various forms of rate limiting, again to protect the controller. I think because edge routers can be scaled, we haven't generally hit the same kinds of issues. We are going to want those strategies for the router as well, in the long run, to allow for more efficient networks.

1 Like

Beyond the software design that Paul notes, there are many options in the design and operations of the network. We have deployed HA pairs at ingress and egress, leveraging VRRP as well as load balancing/sharing in both. The fabric is a more interesting case depending on the geographic design and traffic patterns. In all cases, proper monitoring and growth can prevent the kind of failures you are talking about, including with automation. With the software-defined nature of the network, and virtual machine operations either on premise or in the cloud, it is simple to replace or upsize routers as resource constraints arise or add routers to the mesh. A new router coming into the network will be operational in a few minutes and be available for routing circuits. Routers under duress will have impacts reported in the latency measurements, load cost, and other metrics that will immediately be taken into account by the controller and used to spread traffic appropriately. This can be done by threshold monitoring and scheduled growth activities, or by operational response, manual or automated.

In the end, like all networks, the success is a mixture of good infrastructure, which for OpenZiti is software and the node to run it and the network connectivity, good design engineering, and strong monitoring and operations. The more critical the network operation is, the more it takes to operate it to the level you demand, but OpenZiti has an amazing array of options, the best reporting and metrics in the space, and can enable whatever you need for operations and availability no matter your network model.

1 Like