HA Controller Disconnected Functionality

cmbryner · April 21, 2025, 8:54pm

Been Playing with the HA Controller and it does everything I need it to do, Just out of curiosity though if a controller were to go disconnected could I make a policy change to it. I did try and it failed because the controller was a non-voting member and couldn't become a leader but if I were to allow that controller to become a leader could changes be made outside of the cluster and brought in once network is restored.

plorenz · April 22, 2025, 1:56pm

If a controller is disconnected, then it can't become a leader. Without a quorum of voting members, it can't get elected.

If a cluster is partitioned, then the partition with the quorum should be able to elect a leader, and changes can be made on that side of the partition.

If you had a disconnected node, you could remove it from the cluster and initialize it so that it's a standalone cluster of size 1. You could then make changes to this standalone node. If you tried rejoining this standalone node back to the original network, it would fail because you're not allowed to join two clusters together.

If you somehow managed to join two clusters together, with different journals, they would not synchronized. New entries would be distributed across the cluster, but old journal entries would be different. So index 10 on node 1 might be adding a service and index 10 on node 2 might be updating an identity. This is why it's a bad idea to try and join clusters together

We've discussed ways in which we might be able to support networks where some controllers are intentionally allowed to be disconnected for long periods, specifically about how to allow model changes on disconnected controllers. We'd need to settle on a use case to optimize for to minimize complexity. There are a lot of ways in which we could make the overall software worse, trying to support a very niche use case.

If you've got thoughts on why you'd want to allow this and how you see conflict resolution working, I'd be curious to hear them.

Without knowing the use case, my initial thoughts are:

Provide some convenience method to allow a node to be restarted in standalone mode.
Provide a convenience method to allow rejoining a standalone node to a cluster, wiping out the local raft/db and re-syncing from the cluster.

It's not particularly seamless, but it avoids a lot of complexity compared to things like:

storing changes to be attempted to synced once reconnected
having an in-memory shim of changes to provide the appearance of changes, which can be removed once re-synced.

Paul

cmbryner · April 22, 2025, 2:30pm

I definitely see why it would be a bad idea to allow changes and the network could become a distributed mess. The Use case is very niche essentially we have a cluster of 3 controllers in a stable location these are our master controllers here policy/identities will be created and these edge nodes will be on the move with not always the best network connection and still need to be able to work using that policy that already exists which seems to be the case, It would be rare that changes would have to be made but in an emergency I could see it beneficial to be able to easily switch back and forth and wipe the database upon re-entry because who knows what was done with it while disconnected

Topic		Replies	Views
2 Controller HA Cluster	8	79	May 22, 2025
Cloud based Multi-Region Controller HA? General Questions	3	36	August 29, 2024
HA - stuck without a leader General Questions	3	45	March 28, 2025
Controller HA Kubernetes	2	9	June 3, 2025
Controllers Cluster issues General Questions	2	20	May 19, 2025

HA Controller Disconnected Functionality

Related topics