What is the easiest way to create an HA setup with 3 controllers

Maybe I can help. I haven't had a problem here. The service's default behavior is to automatically manage run-as user and filemode and owner for all files in the working directory.

You may provide a crafted configuration, database, PKI instead of generating/bootstrapping any combination of the above.

What's the specific issue with using your custom config.yml? From root's perspective, it's the same filesystem. For example, root may place a config.yml in the working dir sudo vim /var/lib/ziti-controller/config.yml. Then, when the service is restarted, systemd will set filemode/owner on all files in the working directory.

I understand you need a Ziti router on a Linux distribution from the RedHat family. The openziti-router RPM includes a script for generating the router's config.yml which you may then customize, so you may not need the auto_enroll Python CLI that was intended as a more comprehensive configurator. Here's the Linux router deployment guide: Router Deployment | OpenZiti

1 Like

Thanks for all the info. Didn't realize there was a script for RPM routers.

Yeah that's not what happened in my case. I get all sorts of permission errors when there's a file that root placed in the working dir. Like "ERROR: database file '/var/lib/private/ziti-controller/ziti.db' is not writable". And that's even using that systemd tmpfiles solution. So yeah I'm going down the route of editing the systemd service file.

Also, the ziti agent command wouldn't actually interact with the things I had running because the controller was on its "own filesystem" (I understand there's a way to enter that namespace and run commands using the .sock file that's there but I didn't really like that approach).

Either way, I think I'm getting close. I'm open to share all the steps I took (maybe make my terraform module configurable and sharing it with you) to get this running, if you want. I'll let you know when I have a working HA setup.

Thanks again @qrkourier !

1 Like

I'm so glad you are finding a way forward, and acknowledge the awkwardness of nsenter for interacting with the agent on Linux. It's a set of tradeoffs, for sure. On the one hand, it's nice to not manage user and filemodes/owners (assuming it works correctly :sweat_smile: ), but it's an extra step to call the agent. We could add a behavior in the agent CLI to detect running on Linux to eliminate that step. :thinking:

I understand you're customizing the controller service unit to begin manually managing user, filemode, and file owner in the working directory. That should work fine. Still, I'm curious why systemd didn't succeed at setting the filemode/owner on those files. If you're able to assist with troubleshooting, the things I'd check are:

  1. Is /var/lib a mountpoint?
  2. Does chown -R nobody:nobody /var/lib/ziti-controller and systemctl restart ziti-controller.service work around that problem?
1 Like

Having the agent detect the controller's namespace would be a great addition, for sure.

  1. No, I don't think so.
  2. Haven't tried that. Probably won't be able to rn - maybe once my setup is configured I can play with it a little bit and let you know
1 Like

Guys, I keep getting these errors:

Apr 16 18:59:14 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:219","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:11.0.1.197:38400] (unknown/unenrolled router, routerId: rgiKRdK.i)","time":"2025-04-16T18:59:14.341Z"}
Apr 16 18:59:14 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:199","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1.1","level":"error","msg":"could not clear connection deadline for [tls:11.0.1.197:38400] (set tcp 12.0.3.156:1280: use of closed network connection)","time":"2025-04-16T18:59:14.341Z"}
Apr 16 18:59:16 ip-12-0-3-156 ziti[4914]: {"file":"github.com/openziti/ziti/controller/handler_ctrl/connect.go:116","func":"github.com/openziti/ziti/controller/handler_ctrl.(*ConnectHandler).HandleConnection","level":"error","msg":"unknown/unenrolled router","routerId":"rgiKRdK.i","time":"2025-04-16T18:59:16.899Z"}
Apr 16 18:59:16 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:219","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:11.0.1.197:38402] (unknown/unenrolled router, routerId: rgiKRdK.i)","time":"2025-04-16T18:59:16.899Z"}
Apr 16 18:59:16 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:199","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1.1","level":"error","msg":"could not clear connection deadline for [tls:11.0.1.197:38402] (set tcp 12.0.3.156:1280: use of closed network connection)","time":"2025-04-16T18:59:16.899Z"}
Apr 16 18:59:25 ip-12-0-3-156 ziti[4914]: {"file":"github.com/openziti/ziti/controller/handler_ctrl/connect.go:116","func":"github.com/openziti/ziti/controller/handler_ctrl.(*ConnectHandler).HandleConnection","level":"error","msg":"unknown/unenrolled router","routerId":"rgiKRdK.i","time":"2025-04-16T18:59:25.932Z"}
Apr 16 18:59:25 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:219","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:11.0.1.197:48882] (unknown/unenrolled router, routerId: rgiKRdK.i)","time":"2025-04-16T18:59:25.932Z"}
Apr 16 18:59:25 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:199","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1.1","level":"error","msg":"could not clear connection deadline for [tls:11.0.1.197:48882] (set tcp 12.0.3.156:1280: use of closed network connection)","time":"2025-04-16T18:59:25.932Z"}
Apr 16 18:59:35 ip-12-0-3-156 ziti[4914]: {"file":"github.com/openziti/ziti/controller/handler_ctrl/connect.go:116","func":"github.com/openziti/ziti/controller/handler_ctrl.(*ConnectHandler).HandleConnection","level":"error","msg":"unknown/unenrolled router","routerId":"rgiKRdK.i","time":"2025-04-16T18:59:35.874Z"}
Apr 16 18:59:35 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:219","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:11.0.1.197:38384] (unknown/unenrolled router, routerId: rgiKRdK.i)","time":"2025-04-16T18:59:35.874Z"}
Apr 16 18:59:35 ip-12-0-3-156 ziti[4914]: {"_context":"tls:0.0.0.0:1280","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:199","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1.1","level":"error","msg":"could not clear connection deadline for [tls:11.0.1.197:38384] (set tcp 12.0.3.156:1280: use of closed network connection)","time":"2025-04-16T18:59:35.875Z"}

Right now I have two controllers and two routers. Had to run ziti agent cluster init admin in both controllers because I got that error that the cluster wasn't initialized in the second controller.
Do I need to enroll all routers to all controllers manually? And also, do I really need to init the admin in both controllers?

The controllers can see each other:

If it helps, this is the config.yml for one of them:

cluster:
  dataDir: ./data/ctrl

db: ./ziti.db

identity:
  cert: ./certs/server.chain.pem
  key: ./keys/server.key
  ca: ./certs/stagingctrl.chain.pem

ctrl:
  listener: tls:0.0.0.0:1280
  options:
    advertiseAddress: tls:ec2-52-73-128-149.compute-1.amazonaws.com:1280

edge:
  api:
    address: "ec2-52-73-128-149.compute-1.amazonaws.com:1280"
  enrollment:
    signingCert:
      cert: ./certs/stagingctrl.cert
      key: ./keys/stagingctrl.key
    edgeIdentity:
      duration: 5m
    edgeRouter:
      duration: 5m

web:
  - name: all-apis-localhost
    bindPoints:
      - interface: 0.0.0.0:1280
        address: "ec2-52-73-128-149.compute-1.amazonaws.com:1280"
    options:
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
    apis:
      - binding: health-checks
      - binding: fabric
      - binding: edge-management
      - binding: edge-client
      - binding: edge-oidc
      - binding: zac
        options:
          location: /opt/openziti/share/console
          indexFile: index.html

You shouldn't init both controllers. General flow is:

  1. Init first controller in cluster, to get a 1 node cluster
  2. Add additional controllers to cluster

You shouldn't have been able to join the nodes after they had both been initialized. Can you tell me how you achieved that? When you init a controller, it generates a cluster id. Nodes with different cluster ids shouldn't be able to join.

Example:

$ ziti agent cluster init -i ctrl1 admin admin admin
success
$ ziti agent cluster init -i ctrl2 admin admin admin
success
$ ziti agent cluster add -i ctrl1 tls:localhost:6363
cluster add failed: id not supplied and unable to retrieve [unable to dial tls:localhost:6363: local cluster id 712e79b3-f1db-4286-9fac-6b59b8f3c1fd doesn't match peer cluster id bad25bfc-b691-4fef-b06b-1fabf076cb4e]

I'm guessing this another case where providing the id allowed the add to bypass a sanity check. I
I'm going to add an issue to always do the check to validate that controllers are reachable and valid before updating the cluster list. Always check that a controller is reachable and valid before adding it to an HA controller cluster · Issue #3005 · openziti/ziti · GitHub

In any case, joining two separate clusters together means your data model is in an indeterminate state.

Paul

1 Like

I'm not using -i at all. Every time I try to use it, I get this:
Error: no processes found matching filter, use 'ziti agent list' to list candidates

I have three separate instances, not trying to spin up an HA cluster with just one now.

--tcp-addr just halts with "EOF".

I have this now:

And after removing the DB files and stopping the controllers at stagingctrl and prodctrl, then adding them back from devctrl (and enrolling both routers to devctrl), I get this:

I've followed the steps to generate the PKI and downloaded each controller's PKI files before setting up the controller.

# Create the trust root, a self-signed CA
ziti pki create ca --trust-domain <redacted>.internal --pki-root ./pki --ca-file ca --ca-name '<Redacted> Internal Trust Root'

# Create the dev controller intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file devctrl --intermediate-name 'Dev Controller Signing Cert'
# Create the dev controller server cert
ziti pki create server --pki-root ./pki --ca-name devctrl --dns ${devctrl_dns} --ip ${devctrl_ip} --server-name devctrl --spiffe-id 'controller/devctrl'
# Create the dev controller server cert
ziti pki create client --pki-root ./pki --ca-name devctrl --client-name devctrl --spiffe-id 'controller/devctrl'

# Create the staging controller intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file stagingctrl --intermediate-name 'Staging Controller Signing Cert'
# Create the staging controller server cert
ziti pki create server --pki-root ./pki --ca-name stagingctrl --dns ${stagingctrl_dns} --ip ${stagingctrl_ip} --server-name stagingctrl --spiffe-id 'controller/stagingctrl'
# Create the staging controller client cert
ziti pki create client --pki-root ./pki --ca-name stagingctrl --client-name stagingctrl --spiffe-id 'controller/stagingctrl'

# Create the prod controller intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file prodctrl --intermediate-name 'Prod Controller Signing Cert'
# Create the prod controller server cert
ziti pki create server --pki-root ./pki --ca-name prodctrl --dns ${prodctrl_dns} --ip ${prodctrl_ip} --server-name prodctrl --spiffe-id 'controller/prodctrl'
# Create the prod controller client cert
ziti pki create client --pki-root ./pki --ca-name prodctrl --client-name prodctrl --spiffe-id 'controller/prodctrl'

Ended up going with a single controller solution, using a separate block storage disk for the /var/lib/ziti-controller folder so if I have to change the machine's base image I don't lose the whole setup.
I found it very hard to get the controllers to communicate on an HA setup.

Sorry you had a hard experience overall. We are still working through docs, guides and experience so hopefully this is just a timing issue. It's definitely more complex to establish a cluster, no doubt.

We do appreciate that you took the time to tell us that you struggled, it helps us to know where to focus in doc/ux/devx/etc. hopefully we can keep making it easier for the future.

If you have any other feedback on where things feel apart for you, you can dm me directly here on discourse if you like or just put it into this thread.

1 Like

Yeah, definitely!
I didn’t share more because I set up a deadline for this project so I had to be quick implementing it. But I’ll play with OpenZiti a little more on my free time. Hopefully I can get this HA setup running and share a list of steps and what I find hard to do and what is easy to go through.
I just finalized setting up my services and policies and will onboard people on Monday, so hopefully by next week I can get back to this HA setup.
Thanks for all the help, guys.

2 Likes