Hi @ehsanj , that's an interesting question. SDKs will use whichever edge router currently has the lowest latency. On the terminating side of the circuit, we can load balance, using the number of circuits to try to avoid overloading any given terminator. On the ingress side, the SDK chooses the initiating edge router. The SDK doesn't have global knowledge, like the controller, so it can't make as good of a choice.
We have discussed various ways this could be improved. The most obvious, but also most difficult, is extending the fabric to the SDK. In the near term, there are some quick fixes we could make. Instead of always picking the lowest latency, we could set a threshold and randomly pick from ERs that are within X of the lowest latency, or doing weight random selection from the lowest N.
I don't know that an ER would fall over in the conditions you describe. The SDK would already have a connection to each of the ERs it has access to, so there wouldn't be any new connections being established. There would be some extra memory from new circuits being established, so you could hit a memory threshold and enter a GC spiral. Routers generally aren't CPU-bound. If the outgoing links were unable to sustain the throughput, the flow control would kick in and incoming traffic would be throttled. So there are some ways in which the router could be downed or impaired.
I think in practice this is mitigated by monitoring traffic and using ER policies to make sure that there's sufficient backup capacity on grouped edge routers.
I'm going to see if one of our ops folks wants to comment on if we've seen scenarios like this and if they have anything to add or correct.
I'm not sure this was a very satisfying answer. I will say there are some pain points related to SDK edge router selection that we've heard from users, so I expect we'll be tackling this in some form once HA is released.
Cheers,
Paul
Edit: Wanted to mention we do use exponential backoff with jitter in several places, just not here. We use it more for protecting the controller against stampedes, for example when establishing terminators, api-sessions and edge sessions. I think, as you noted, that smart/random selection is probably more appropriate here than backoff. We also have introduced various forms of rate limiting, again to protect the controller. I think because edge routers can be scaled, we haven't generally hit the same kinds of issues. We are going to want those strategies for the router as well, in the long run, to allow for more efficient networks.