HA Controller Disconnected Functionality

Been Playing with the HA Controller and it does everything I need it to do, Just out of curiosity though if a controller were to go disconnected could I make a policy change to it. I did try and it failed because the controller was a non-voting member and couldn't become a leader but if I were to allow that controller to become a leader could changes be made outside of the cluster and brought in once network is restored.

If a controller is disconnected, then it can't become a leader. Without a quorum of voting members, it can't get elected.

If a cluster is partitioned, then the partition with the quorum should be able to elect a leader, and changes can be made on that side of the partition.

If you had a disconnected node, you could remove it from the cluster and initialize it so that it's a standalone cluster of size 1. You could then make changes to this standalone node. If you tried rejoining this standalone node back to the original network, it would fail because you're not allowed to join two clusters together.

If you somehow managed to join two clusters together, with different journals, they would not synchronized. New entries would be distributed across the cluster, but old journal entries would be different. So index 10 on node 1 might be adding a service and index 10 on node 2 might be updating an identity. This is why it's a bad idea to try and join clusters together :slight_smile:

We've discussed ways in which we might be able to support networks where some controllers are intentionally allowed to be disconnected for long periods, specifically about how to allow model changes on disconnected controllers. We'd need to settle on a use case to optimize for to minimize complexity. There are a lot of ways in which we could make the overall software worse, trying to support a very niche use case.

If you've got thoughts on why you'd want to allow this and how you see conflict resolution working, I'd be curious to hear them.

Without knowing the use case, my initial thoughts are:

  1. Provide some convenience method to allow a node to be restarted in standalone mode.
  2. Provide a convenience method to allow rejoining a standalone node to a cluster, wiping out the local raft/db and re-syncing from the cluster.

It's not particularly seamless, but it avoids a lot of complexity compared to things like:

  1. storing changes to be attempted to synced once reconnected
  2. having an in-memory shim of changes to provide the appearance of changes, which can be removed once re-synced.

Paul

I definitely see why it would be a bad idea to allow changes and the network could become a distributed mess. The Use case is very niche essentially we have a cluster of 3 controllers in a stable location these are our master controllers here policy/identities will be created and these edge nodes will be on the move with not always the best network connection and still need to be able to work using that policy that already exists which seems to be the case, It would be rare that changes would have to be made but in an emergency I could see it beneficial to be able to easily switch back and forth and wipe the database upon re-entry because who knows what was done with it while disconnected