HA Controller cluster deployment

Hi everyone,
I've been testing a couple of things in my lab, but actually I've been crashing into a wall.

In this test I have:
2 x Controllers
1 x Load Balancer

What I wanted to attain is a deployment with load balancers and the cluster (on-premise), and be reacheable for cloud routers (multi-cloud).

In my head controller i have this:
raft:
bootstrapMembers:
- tls:xxx.xxx.xxx.xxx:8440
- tls:xxx.xxx.xxx.xxx:8440
..........
ctrl:
options:
advertiseAddress: tls:xxx.xxx.xxx.xxx:8440
........
The error prompted when trying to run the controller config:
FATAL fabric/controller/raft.(*Controller).validateCert: {error=[invalid controller certificate, no controller SPIFFE ID in cert]} controller cert must have Subject Alternative Name URI of form spiffe:///controller/

I've used the quickstart from Host it anywhere

Which address should be into the advertise, the public one or the hostname?
Is it possible to use load balancers other than nginx?

Many thanks in advance

1 Like

Hi @MoonCraver, welcome to the community and to OpenZiti!

I have no seat time with HA configurations yet so sadly I can't help out yet. @plorenz or @andrew.martinez, I think one of you are going to need to help out here?

You should advertise the public address

Hi @MoonCraver,
HA isn't feature complete yet. You should be able to get a cluster up and running, the model is distributed and distributed routing should be working. However, distributing sessions and posture check data is still in progress, so edge SDK connectivity will either not work at all, or only work sometimes. For now, I'd recommend starting off with a single controller.

If you want to experiment with what's working so far for HA, you'll need to generate your controller certs with a trust domain and spiffe IDS. There are some HA docs here. Take a look at the create-pki.sh script.

Your trust root should have a trust domain and your server cert needs a SPIFFE id. That id is how controllers identity each.

# Create the trust root, a self-signed CA
ziti pki create ca --trust-domain ha.test --pki-root ./pki --ca-file ca --ca-name 'HA Example Trust Root'

# Create the controller 1 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ctrl1 --intermediate-name 'Controller One Signing Cert'

# Create the controller 1 server cert
ziti pki create server --pki-root ./pki --ca-name ctrl1 --dns localhost --ip 127.0.0.1 --server-name ctrl1 --spiffe-id 'controller/ctrl1'

Let me know if that helps or if I can clarify things some more.

Cheers,
Paul

Thanks guys!

I'll give it a try and ask for detailed help if I missed something.

It would be great to have the shared sessions and posture working, but as a backup in case of failure of any of the servers it is enough for me at the moment.

BR

Hi again!

I've been testing into the case you gave me from the docs.

There are some flaws that are coming to crash the cluster conformation:

1.- When you are deploying all the controllers, can't communicate between instances. Also trying with a custom yml, but it was impossible to make connections between them with the sdk.

2.- When the raft module loads in the master instance, it's not forced to become the leader, leaving orphaned cluster members. All of them becomes votee, but anyone assumes the leadership role.

3.- When you are conforming the cluster, the sdk force you become a voter as default, failing the flag "--voter false"

Hope it helps to improve this project a little.

BR

Re 1: Can you expand a little on your test setup? It looked like you were setting up a two node cluster. Are you using the config file to add the bootstrap members on one node or both? Or are you using ziti agent cluster add? Do you have raft: { minClusterSize: 2 } set in the config file? I'd like to be able to reproduce what you're seeing.

Re 2: In Raft there's no dedicated leader. All voting members of the cluster are eligible to become the leader. There is a ziti agent cluster transfer-leadership command, but that is intended primarily for graceful upgrades, etc.

Re: 3: Sounds like a bug, I'll investigate.

Thanks for sharing your test results. I'm not confident your setup will work for everything you want to do, but at a bare minimum we should at least be able get the cluster up and running and admin functionality should work.

Paul

Doing some testing, it looks like the ALPN work with went in with 0.30.0 (allowing consolidation to a single listening server port) has broken HA cluster setup. Because HA isn't feature complete yet, HA smoketests aren't running for releases yet. I'll let you know when a fix is available.

Update
I put in an issue to track this: Raft cluster connections not updated for ALPN ยท Issue #783 ยท openziti/fabric ยท GitHub There's a PR up with a fix.

I also tried to reproduce the --voter=false issue, but it was working for me.

$ ziti agent cluster list -i ctrl1
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ID    โ”‚ ADDRESS            โ”‚ VOTER โ”‚ LEADER โ”‚ VERSION โ”‚ CONNECTED โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ctrl1 โ”‚ tls:localhost:6262 โ”‚ true  โ”‚ true   โ”‚ v0.0.0  โ”‚ true      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
$ ziti agent cluster add -i ctrl1 tls:127.0.0.1:6363 --voter=false
success, added ctrl2 at tls:localhost:6363 to cluster
$ ziti agent cluster list -i ctrl1
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ID    โ”‚ ADDRESS            โ”‚ VOTER โ”‚ LEADER โ”‚ VERSION โ”‚ CONNECTED โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ctrl1 โ”‚ tls:localhost:6262 โ”‚ true  โ”‚ true   โ”‚ v0.0.0  โ”‚ true      โ”‚
โ”‚ ctrl2 โ”‚ tls:localhost:6363 โ”‚ false โ”‚ false  โ”‚ v0.0.0  โ”‚ true      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Let me know what I'm doing differently. Hopefully should have a release with the fix for the cluster connections out soon.

Cheers,
Paul

Hi @MoonCraver , a fix for the cluster connections issue has been released in v0.30.3. Let me know if you get a chance to try it, and if so, how it goes.

Cheers,
Paul

Morning,

  • Re1:

raft:
minClusterSize: 3
dataDir: ./data/ctrl1
bootstrapMembers:
- tls:xxx.xxx.xxx.xxx:8446
- tls:xxx.xxx.xxx.xxx:8447
- tls:xxx.xxx.xxx.xxx:8448
leaderLeaseTimeout: 2s
electionTimeout: 5s
commitTimeout: 1s
heartbeatTimeout: 5s

Re 2&3: For this issue I've passed "false" as an argument not a value. Perhaps It's where I wronged the command.

Thanks guys for your work! I'll be testing along the day and share the results, along with the infra used for it.

BR

Hello again!
I managed to deploy the drivers in HA without problems.

Some fixes need to be made in the docs to make it run like clockwork.

I leave you my lab configurations.

I also may send you the improvements in the manual to update the deployment and controller configurations

After doing some performance and debug tests, I will test it as part of a pool on a software load balancer, with a ZAC mounted on docker and certificates (self-signed and/or public).

As a milestone, if the router enrollment is done, it is able to recognize the controllers in HA and balance the traffic according to the load.


HARDWARE SPECS (VIRTUALIZED):
HOST:
Proxmox Virtual Environment - Debian Based
4 Cores - 32GB RAM - 256GB SSD

GUESTS:

Controllers:
3 x Ubuntu server 22.04
2vCores - 2GB RAM - 20GB SSD

Load Balancer:
1 x Zevenet CE
2vCores - 2GB RAM - 20GB SSD

Edge Router:
1 x Ubuntu server 22.04
2vCores - 2GB RAM - 20GB SSD

ZAC:
1 x Ubuntu server 22.04
2vCores - 2GB RAM - 20GB SSD
With docker

BR

2 Likes

Thanks, @MoonCraver , let us know how your testing goes. If you've got doc tweaks, I'll be happy to incorporate them. I still need to move the docs from ziti/doc/ha into the ziti-doc repo and do some reorganization, clean-up and expansion. As, I mentioned earlier I'll be surprised if things work for you if you're using SDK clients. If you're using non-edge based ingress into the fabric, using something like xgress_proxy, that should work. If you're using tunneler enabled routers, that may work, I haven't thought through it to see if it would be ok. Either way, curious to hear how it goes.

Cheers,
Paul

Hi @plorenz ,

I would like to ask the possibility to scale out OpenZiti controller.

As I understand so far, HA means there will be a failover function for keeping the service availability.
But under heavy loading, isn't it better to apply a scalable(scale-out) deployment for OpenZiti controller?

Thanks!

Best Regards,
Rick

Hi @rickwang7712

Let me run through the various kinds of availability and scale you can expect. I'll use HA to mean high availability, meaning we can withstand losing one or more servers. I'll use HS for horizontally scalable, meaning that the load can spread across multiple servers.

Services can be terminated on multiple routers. Based on how it's configured, this can be done to provide just HA or HA and HS. See the guide for more details.

If the controller is running in a cluster, then circuits can be routed on any controller so once a service is configured, routing can be load-balanced across the controllers and is HS. The only caveat here is that whichever controller routes a circuit also owns that circuit and other controllers do not know about the circuit and cannot take it over if needed. If that controller goes down, the circuit will continue to work, but re-routes won't work. Improving the routing model will be a focus of post-HA development.

For model updates (creating/updating/deleting services, identities, configs, policies, etc), the controllers use Raft. So model updates are HA, not HS. This is because for model updates we prioritized consistency over availability. There are some use cases where this might not be the right trade-off. For those, we're thinking about what makes sense. Initial thoughts have been around multi-tenant routers so that routers can be shared by independent controllers. or some version of controller federation. This is an area where we don't have a clear plan yet.

Let me know if I can clarify anything.

Cheers,
Paul

1 Like