HA - stuck without a leader

Giving HA a try.. I must have done something wrong here. Not exactly sure if its on the ctrl2 side or if it was while adding to the cluster from ctrl1. ctrl2 is online as a uninitialized node. Appreciative of any assistance..

`root@ctrl1:/home/testenv# systemctl show -p MainPID --value ziti-controller.service | xargs -rIPID sudo nsenter --target PID --mount -- ziti agent cluster list
╭───────┬──────────────────────────┬───────┬────────┬─────────────────┬───────────╮
│ ID    │ ADDRESS                  │ VOTER │ LEADER │ VERSION         │ CONNECTED │
├───────┼──────────────────────────┼───────┼────────┼─────────────────┼───────────┤
│ ctrl1 │ tls:ctrl1.a.internal:443 │ true  │ false  │ v1.4.3          │ true      │
│ ctrl2 │ ctrl2.a.internal:443     │ true  │ false  │ <not connected> │ false     │
╰───────┴──────────────────────────┴───────┴────────┴─────────────────┴───────────╯
root@ctrl1:/home/testenv# systemctl show -p MainPID --value ziti-controller.service | xargs -rIPID sudo nsenter --target PID --mount -- ziti agent cluster remove ctrl1.a.internal:443
error: CLUSTER_NO_LEADER: Cluster has no leader, unable to make model updates.

root@ctrl1:/home/testenv# ziti ops cluster remove ctrl2
Error: [POST /cluster/remove-member][503] clusterMemberRemoveServiceUnavailable  &{Error:0xc000d7de00 Meta:0xc00027d940}

root@ctrl1:/home/testenv# ziti ops cluster transfer-leadership ctrl1
Error: [POST /cluster/transfer-leadership][503] clusterTransferLeadershipServiceUnavailable  &{Error:0xc000e2c000 Meta:0xc00027b940}

I can see what went wrong, the address for the second node is incorrect, it should have a tls: prefix. I added an issue to fix this: Validate node address before adding to cluster · Issue #2922 · openziti/ziti · GitHub

If you add a node using ziti agent cluster add <addr>, you won't encounter this, as it will reach out to the node before adding it, to get the id. This will only work if the address is valid, so any mistakes will be caught before the node is added.

To recover from this situation I'd take the following steps:

  1. Make a db snapshot: ziti fabric db snapshot.
  2. Shut ctrl1 down.
  3. Copy the db snapshot to a safe location
  4. Make a backup of the cluster data directory
  5. Delete the cluster data directory. This will wipe your raft journal and database.
  6. Update the controller config db: parameter to point to the snapshot database.
  7. Start the controller. It will initialized using the snapshot db, but ctrl2 won't be a member.
  8. Re--add ctrl2, this time not specifying --id, to make sure you're proving the correct address
    a. something like:ziti agent cluster add tls:ctrl2.a.internal:443

Hope that's helpful.
Paul

Thanks :slight_smile: That was it and the steps worked perfectly! Turns out I had to adjust my cert configs in conf.yml too, the HA example configs in git helped me sort that out. Very neat and while setup may seem a little daunting at first, things do start to "make sense".

I know HA is new so no expectations but is Windows client + Ext JWT auth compatible? Seem to be seeing some sort of success in the initial auth, with my test service listed but I'm not able to connect. On my edge router I am seeing "invalid client certificate for api session" "failure accepting channel edge with underlay".

Not sure if I fixed my PKI or if the latest Ziti updates helped, but thought I'd mentioned I was able to get ext jwt auth with HA working, no longer seeing the invalid cert issues as I mentioned in my last comment :slight_smile: Though I'm hoping the visualizer feature in ZAC isn't compatible with HA because it seems to be showing more errored links then one would hope for..