HA controllers docker compose and going from single node to HA

Hello again!

Hopefully the last time mythering everyone on here :grin:

If I already have a single node controller up with docker compose, is it possible to add some more controllers and create a cluster (ideally using docker compose) on remote machines, or do I have to delete the existing single node and start again from scratch in HA mode?

I had a quick stab and creating new controllers, swaping out the certs for the existing controller ones then trying to manually run ziti quickstart join command inside the container of the new controller but wasn't really getting anywhere as I don't think I had the commands right.

Just wondering how it should normally be done.

Many thanks,

Jon.

Hello!

Here's the migration guide: Migrating Controllers | OpenZiti

An overview:

  1. save a database snapshot
  2. stop the standalone controller
  3. edit its config.yml to set a dataDir, preserving the db path (it will migrate to dataDir on startup if both are configured)
  4. add spiffe ID to the controller's identity certs - the ID is composed like spiffe://{{trust domain}}/controller/{{node name}}
  5. start first controller
  6. optionally, grow the cluster by adding subsequent nodes with signer cert from same root. Each member's spiffe ID must have the same spiffe://{{trust domain}} part and a unique /controller/{{node name}} part

There are more details about this in the other articles about the clustered mode of operation which are organized in the same section of the docs as the migration guide I linked.

Let us know if you get stuck.

Good luck.

1 Like

Thanks will give it a go!

I'm getting somewhere now but also hitting an error with the certificates

I followed the guide on creating the certs but after adding a 2nd cluster member on the first cluster node, I got this error

ziti-controller-1   | {"error":"error dialing peer tls:ziti-controller3.wizznet.co.uk:1280: tls: failed to verify certificate: x509: certificate signed by unknown authority","file":"github.com/hashicorp/raft@v1.7.3/raft.go","func":"github.com/hashicorp/raft.(*Raft).preElectSelf.(*Raft).preElectSelf.func1.func2","level":"error","msg":"failed to make requestVote RPC","target":{"Suffrage":0,"ID":"ziti-controller3","Address":"tls:ziti-controller3.wizznet.co.uk:1280"},"term":5,"time":"2025-06-01T17:00:27.034Z"}

Also the cluster instantly implodes and I see that it has no_leader anymore so I have remove the raft directory and start again as all cluster commands start failing.

I'm referencing in the config like this:

First controller (initial controller) ziti-controller1

[root]# cat config.yml
v: 3

#trace:
#  path: "ziti-controller.wizznet.co.uk.trace"

#profile:
#  memory:
#    path: ctrl.memprof



db:                     "/ziti-controller/bbolt.db"
# uncomment and configure to enable HA
cluster:
  dataDir:         "/ziti-controller/raft"


identity:
  cert:        "pki/ziti-controller1/certs/client.cert"
  server_cert: "pki/ziti-controller1/certs/server.cert"
  key:         "pki/ziti-controller1/keys/server.key"
  ca:          "pki/ca/certs/ca.cert"

2nd controller (ziti-controller3) (sorry for the confusing numbering)

v: 3

#trace:
#  path: "ziti-controller.wizznet.co.uk.trace"

#profile:
#  memory:
#    path: ctrl.memprof



db:                     "/ziti-controller/bbolt.db"
# uncomment and configure to enable HA
cluster:
  dataDir:         "/ziti-controller/raft"


identity:
  cert:        "pki/ziti-controller3/certs/client.cert"
  server_cert: "pki/ziti-controller3/certs/server.cert"
  key:         "pki/ziti-controller3/keys/server.key"
  ca:          "pki/ca/certs/ca.cert"
  #alt_server_certs:
  #  - server_cert:  ""
  #    server_key:   "

and the file structure

procedures I used to create certs:

# Create the trust root, a self-signed CA
ziti pki create ca --trust-domain wizznet.co.uk --pki-root ./pki --ca-file ca --ca-name 'Wizznet Root CA'

# Create the controller 1 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ziti-controller1 --intermediate-name 'Controller One Signing Cert'

# Create the controller 1 server cert
ziti pki create server --pki-root ./pki --ca-name ziti-controller1 --dns "localhost,ziti-controller1,ziti-controller1.wizznet.co.uk" --ip "127.0.0.1,::1,10.60.0.120" --server-name ziti-controller1 --spiffe-id 'controller/ziti-controller1'

# Create the controller 1 server cert
ziti pki create client --pki-root ./pki --ca-name ziti-controller1 --client-name ziti-controller1 --spiffe-id 'controller/ziti-controller1'

# Create the controller 2 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ziti-controller2 --intermediate-name 'Controller Two Signing Cert'

# Create the controller 2 server cert
ziti pki create server --pki-root ./pki --ca-name ziti-controller2 --dns "localhost,ziti-controller2,ziti-controller2.wizznet.co.uk" --ip "127.0.0.1,::1,10.60.0.174" --server-name ziti-controller2 --spiffe-id 'controller/ziti-controller2'

# Create the controller 2 client cert
ziti pki create client --pki-root ./pki --ca-name ziti-controller2 --client-name ziti-controller2 --spiffe-id 'controller/ziti-controller2'

# Create the controller 3 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ziti-controller3 --intermediate-name 'Controller Three Signing Cert'

# Create the controller 3 server cert
ziti pki create server --pki-root ./pki --ca-name ziti-controller3 --dns "localhost,ziti-controller3,ziti-controller3.wizznet.co.uk" --ip "127.0.0.1,::1,10.60.0.12" --server-name ziti-controller3 --spiffe-id 'controller/ziti-controller3'

# Create the controller 3 client cert
ziti pki create client --pki-root ./pki --ca-name ziti-controller3 --client-name ziti-controller3 --spiffe-id 'controller/ziti-controller3'

Thanks!
Jon

Hi Jon,
I wonder if the cluster splintering and subsequent certificate issues could be caused by having a standalone controller db, e.g., bbolt.db, file path configured for the second controller you're trying to add. That's the only potential concern that leaped out at me.

If you migrated the first controller from standalone to clustered mode, then you can delete the db property from the first controller, and it's never needed for subsequent nodes. They'll require only the cluster.dataDir, and will read the Raft journal from the current cluster leader.


If that's not it, let's examine the third controller's server certificate, which failed validation. Your PKI commands looked correct: the third controller's identity leaf certificates are from the third intermediate signer you created from the common root CA, and they have the required subject and subject alternatives.

{
  "error": "error dialing peer tls:ziti-controller3.wizznet.co.uk:1280: tls: failed to verify certificate: x509: certificate signed by unknown authority",
  "file": "github.com/hashicorp/raft@v1.7.3/raft.go",
  "func": "github.com/hashicorp/raft.(*Raft).preElectSelf.(*Raft).preElectSelf.func1.func2",
  "level": "error",
  "msg": "failed to make requestVote RPC",
  "target": {
    "Suffrage": 0,
    "ID": "ziti-controller3",
    "Address": "tls:ziti-controller3.wizznet.co.uk:1280"
  },
  "term": 5,
  "time": "2025-06-01T17:00:27.034Z"
}
1 Like

Yeah the first one was a single node to cluster migration.

I'll try again without the bolt.db part when joining the next one...

When I start again, ziti1 starts up migrates to HA and runs

Then I try ziti2 again but with boltdb commented out

And get this errror:

openziti-controller-ziti-controller-1 exited with code 123
openziti-controller-ziti-controller-1   | INFO: config file exists in /ziti-controller/config.yml
openziti-controller-ziti-controller-1   | realpath: missing operand
openziti-controller-ziti-controller-1   | Try 'realpath --help' for more information.
openziti-controller-ziti-controller-1   | WARN: set VERBOSE=1 or DEBUG=1 for more output
openziti-controller-ziti-controller-1   | WARN: see output in '/tmp/tmp.riljLnKZ8y'
openziti-controller-ziti-controller-1 exited with code 123
openziti-controller-ziti-controller-1   | INFO: config file exists in /ziti-controller/config.yml
openziti-controller-ziti-controller-1   | realpath: missing operand
openziti-controller-ziti-controller-1   | Try 'realpath --help' for more information.
openziti-controller-ziti-controller-1   | WARN: set VERBOSE=1 or DEBUG=1 for more output
openziti-controller-ziti-controller-1   | WARN: see output in '/tmp/tmp.VwRmhjtBhR'
openziti-controller-ziti-controller-1 exited with code 123

oh, also, I've just realised in the config file I left trustDomain commented out?

# cat config.yml                                                                                                      [21:33:35]
v: 3

#trace:
#  path: "ziti-controller.wizznet.co.uk.trace"

#profile:
#  memory:
#    path: ctrl.memprof



#db:                     "/ziti-controller/bbolt.db"
# uncomment and configure to enable HA
cluster:
  dataDir:         "/ziti-controller/raft"


identity:
  cert:        "pki/ziti-controller2/certs/client.chain.pem"
  server_cert: "pki/ziti-controller2/certs/server.chain.pem"
  key:         "pki/ziti-controller2/keys/server.key"
  ca:          "pki/ca/certs/ca.cert"
  #alt_server_certs:
  #  - server_cert:  ""
  #    server_key:   ""

# trust domains may be overridden by SPIFFE ID as URI SAN
#trustDomain: ziti.example.com

# additional trust domains allow for migrating to a new trust domain
#additionalTrustDomains: []

recreated everything but still get the same handshake failed invalid signature by the client certificate verification error.

error: CLUSTER_NO_LEADER: Cluster has no leader, unable to make model updates.
[ziggy@8923fc0427e0 ~]$ ziti agent cluster list
╭──────────────────┬─────────────────────────────────────────┬───────┬────────┬─────────────────┬───────────╮
│ ID               │ ADDRESS                                 │ VOTER │ LEADER │ VERSION         │ CONNECTED │
├──────────────────┼─────────────────────────────────────────┼───────┼────────┼─────────────────┼───────────┤
│ ziti-controller1 │ tls:ziti-controller1.wizznet.co.uk:1280 │ true  │ false  │ v1.5.4          │ true      │
│ ziti-controller2 │ tls:ziti-controller2.wizznet.co.uk:1280 │ true  │ false  │ <not connected> │ false     │
╰──────────────────┴─────────────────────────────────────────┴───────┴────────┴─────────────────┴───────────╯```


Command run to add the ziti2 controller from ziti1 cli was


ziti agent cluster add tls:ziti-controller2.wizznet.co.uk:1280 --id ziti-controller2

Swapped some certs around, must have been using the wrong ones, getting further now!

openziti-controller-ziti-controller-1 | {"error":"local cluster id 1024172a-877d-456e-88b3-9c3d19e3d994 doesn't match peer cluster id 670585c4-c545-4522-988f-037fbb26bc77","file":"github.com/openziti/channel/v3@v3.0.39/accept_dispatcher.go:83","func":"github.com/openziti/channel/v3.(*UnderlayDispatcher).Run","level":"error","msg":"error handling incoming connection, closing connection","time":"2025-06-02T21:29:54.170Z"}

OK sorted!

I had to make sure the 2nd node didn't have any of these set on the second start after enabling HA

#ZITI_BOOTSTRAP_PKI=true
#ZITI_BOOTSTRAP_CONFIG=true
#ZITI_BOOTSTRAP_DATABASE=true
#ZITI_AUTO_RENEW_CERTS=true

all commented out.

Then on ziti1 I ran the command to join

ziti agent cluster add tls:ziti-controller2.wizznet.co.uk:1280 --id ziti-controller2

Also for the record here are the way I had to setup the cert file references in config.yml v. important this.

identity:
  cert:        "pki/ziti-controller1/certs/server.chain.pem"
  #server_cert: "pki/ziti-controller1/certs/server.cert"
  key:         "pki/ziti-controller1/keys/server.key"
  ca:          "pki/ziti-controller1/certs/ziti-controller1.chain.pem"