Hi @jnsfndr
I've got to add this to our HA documentation, because it's not there in a any consolidated form yet.
So:
What still works
Management API Reads
Controller C can still serve REST API reads (list services, get identities, check policies, etc.) from its local data store. The data may be slightly stale (it reflects the last Raft index C received before quorum was lost), but all GET/LIST operations work fine since they read directly from the local database without touching Raft.
Sessions
API sessions (authentication): New clients can authenticate against C. API sessions do not require Raft consensus. When using OIDC (which is required for HA failover), API sessions are issued as signed JWTs entirely in memory; no database write is involved at all. Even on the legacy authentication path, the session is written only to the controller's local database, which doesn't go through Raft.
Service sessions have the same characteristics as api sessions. Posture check enforcement is now in the router, so that should also be unaffected.
Circuit Creation
Circuit creation does not require a Raft commit. Circuits are entirely in-memory objects on the controller that creates them. The flow is all local reads and router communication. No Raft involvement. If a client has (or can create) a valid service session, it can create new circuits through C even without quorum.
Existing Circuits
Circuits are owned by the controller that created them. Once the route messages have been sent and the routers have their forwarding state, the data plane operates independently of the controller. This has important implications:
Circuits owned by C (the surviving controller):
- Continue to work normally
- Can be rerouted if a link or router in the path goes down. C has the circuit context and can compute an alternate path
Circuits owned by one of the two controllers that went down:
- Continue to work because routers hold the forwarding state and keep moving data independently of the controller
- Get torn down as usual when the circuit completes (client closes the connection, etc.)
- Cannot be rerouted. If a link or router in the path fails, the owning controller isn't around to compute and install an alternate path, so the circuit breaks
- Even if the owning controller comes back up, the in-memory circuit context is lost, so it still can't reroute those circuits
What doesn't work
Model Mutations
The things that require Raft are durable model mutations: creating/updating/deleting services, identities, routers, policies, edge router policies, terminators, and similar configuration entities. On C without quorum, any REST API call that tries to write to these will receive and HTTP 503 with error code CLUSTER_NO_LEADER. The error comes back promptly: clients won't hang.
Biggest Concern
The main thing that could impact operations is that new terminators can't be created. If a hosting router or sdk goes down, the terminator may not come back. In some cases it may be ok because we cache the terminator id, in hopes of avoiding an unnecessary write for transient failures, but this doesn't cover all cases.
Hope that's helpful,
Paul