2 Controller HA Cluster

Hi,
we were able to successfully start an OpenZiti HA cluster of 2 controllers.
2 controllers are currently enough for our project although we read that in our setup we will need 3 controllers for the voting and raft purposes.
we thought about have a tiny 3rd controller but we have an issue with it becoming a leader
we would like to only be a voter.
This "Voter" will in no way be a controller "Leader" and serve any other purposes.
What is the best way of operating 2 controllers and if there is no option, how can we be sure that a third controller will not become a leader.
Thank you!

Hello, I had a couple of thoughts to this:

  1. Two controllers might be fine. If one fails, you won't be able to make to make model updates until the second server is recovered, but existing identities and services should continue to work. So it depends on what your uptime requirements are for being able to make administrative changes.
  2. There's no capability which allows for a voting member which can't be a leader. The best you could do is monitor the system and if you see that the node you don't want to be leader has become leader, you can request that leadership be transferred to a different node.

Hope that's helpful.

Paul

Thank you Paul!
We noticed something interesting though in the replication of data.
Identities, routers are fully replicated
We created a service on ctrl1 and it was replicated to ctrl2 but it is missing the dial policy,
which rendered the service to be non functional when the ctrl1 is being stopped.
Any idea as to why would this happen?

If you run

ziti fabric inspect data-model-index

That will tell you the index of the distributed journal on each controller node.

ziti fabric inspect router-data-model-index

will tell you what index routers have synced to for the smaller set of data being distributed to them.

Can you see if they're in sync?

Please also check the logs for errors, and if you see anything that looks suspicious that you can share, please post it.

Thank you,
Paul

1 Like

Hi Paul,
sorry for the delay, as I had to rebuild it over from scratch in order to verify and get the exact logs.
So this time i created the Service on ctrl1 and the replicated service on ctrl2 is missing the bind policy.
The result of the fabric command looks good and matching on both controllers:
CTRL1

> ziti fabric inspect data-model-index
> Results: (2)
> ctrl1.a.com.data-model-index
> index: 47
> timeline: 7Q0QD3aNR
> 
> ctrl2.a.com.data-model-index
> index: 46
> timeline: M_Dzd3aNg

CTRL2


ziti fabric inspect data-model-index
Results: (2)
ctrl2.a.com.data-model-index
index: 46
timeline: M_Dzd3aNg

ctrl1.a.com.data-model-index
index: 47
timeline: 7Q0QD3aNR

Here is the interesting log on Controller 2

[1261.534]    INFO ziti/controller/raft.(*BoltDbFsm).Apply: {index=[43]} apply log with type *command.CreateEntityCommand[*github.com/openziti/ziti/controller/model.Config]
[1261.565]    INFO ziti/controller/raft.(*BoltDbFsm).Apply: {index=[44]} apply log with type *command.CreateEntityCommand[*github.com/openziti/ziti/controller/model.Config]
[1261.596]    INFO ziti/controller/raft.(*BoltDbFsm).Apply: {index=[45]} apply log with type *command.CreateEntityCommand[*github.com/openziti/ziti/controller/model.EdgeService]
[1261.633]    INFO ziti/controller/raft.(*BoltDbFsm).Apply: {index=[46]} apply log with type *command.CreateEntityCommand[*github.com/openziti/ziti/controller/model.ServicePolicy]
[1261.713]    INFO ziti/controller/raft.(*BoltDbFsm).Apply: {index=[47]} apply log with type *command.CreateEntityCommand[*github.com/openziti/ziti/controller/model.ServicePolicy]
[1261.713]   ERROR ziti/controller/model.(*baseEntityManager[...]).createEntityInTx: {error=[the value '[-uYmdbkune]' for 'identityRoles' is invalid: no identities found with the given ids]} could not create servicePolicy in bolt storage

Hi Tikal,
There's something quite wrong with your setup.

The controllers should be on the same index and same timeline. The timeline identifier is set on init and whenever a snapshot is restored. It's a way to tell routers when they need to do a full pull of the router data model.

Can you walk me through how you set your cluster up? Did you maybe init both nodes and then joined them? Pre 1.6.0 there was a way to bypass a cluster id check, it's possible you hit that condition.

Thank you,
Paul

Hi Paul,
We used the express install on both controllers. but we made a slight change by not making the ctrl2 create the CA root certificate again on the zit-cli-function.sh, as it caused problems trusting the certificates at first.
After installing Ctrl1 we edited the yaml file to enable HA. and then started it.
Then on the install of Ctrl2 we asked it not to recreate the admin password.
then we edited the yaml and enabled the HA on it and started it
We made sure that the ctrl1 is initialized by running:

> ziti agent cluster init admin $ZITI_PWD super_admin -i ctrl1.a.com
then added ctrl2 to the cluster:
> ziti agent cluster add -i ctrl1.a.com tls:ctrl2.a.com:8440

We also noted that we have to do the same on ctrl2 in order for voting to be accepted if the controllers were stopped:
> ziti agent cluster add -i ctrl2.a.com tls:ctrl1.a.com:8440

by the way, we found the error to be that since the edge router existed even before the HA was enabled and was recreated also on ctrl2, this caused the edge router be with same name but with different id, which is why the service recreation was missing the bind. removing the router and then afding it again, fixed this specific issue. but we want to be sure that the sync is really working in this setup

I think what's happening is that your taking two standalone instances, converting them to HA and then joining them, which will end up with a broken HA setup. If you remove the -i flag from the adds, the join would have correctly failed.

Try clearing the state for ctrl2 by removing the database and raft files and then restarting it. It should come up in an uninitialized state. At that point, ctrl 1 should reach back out to it and bring it up to the current index.

I'm pretty sure that if you had tried your setup routine with 1.6, you would have gotten the correct errors messages.

thank you,
Paul