HA Controller not adding to cluster

Hello

I have 3 Controllers (Version 1.5.4) all of which work independent of each other which I try and do ziti agent cluster add I get an error that states the following:

cluster add failed: id not supplied and unable to retrieve [ tls failed to verify cert : x509: cert signed by unknown authority

This has to do with pki, I created all the certs on one of my ziti vms and tared the pki folder and copied it to the others. I then modified the config files to point to the correct cert. Create the certs I followed the instructions on the ziti website

I've done this process before with minimal headaches but today it doesn't want to work

Note: When I try to use the admin wedpage is get tls: first record does not look like tls so there is definitely something wrong with how I did PKI

So I figured out what is going on and not quite sure how to fix it. When I start the ziti controller through ziti run controller config file works fine

when I start through the systemd serviced file does not work, the reason it does not work is the systemd file is trying to use Intermediate certs which I find strange since that is no where in my config

You're certain that the systemd unit is running the exact same command, right? My guess is that your config file is using relative paths and the CWD (current working dir) is different when run via systemd.

I am certain will get config for you

I did notice I am getting alot of errors

err=[not handaler for requested protocols] handshake failed

the connection is erroring on the connection between two ziti vms

Sorry about the delay below is the systemd file and the config files, as as the exact errors I am seeing. Just to note I can get to the Admin Web consoles of both when running ziti controller run config.yml manually, when doing this manually though I can not get the two controllers to communicate. Selinux and firewalld are both disabled

Systemd:

Description=OpenZiti Controller
After=network-online.target

[Service]
Type=simple

# manage the user and permissions for the service automatically
DynamicUser=yes

# this env file configures the service, including whether or not to perform bootstrapping
EnvironmentFile=/opt/openziti/etc/controller/service.env

# relative to /var/lib
StateDirectory=ziti-controller
WorkingDirectory=/var/lib/ziti-controller
ReadOnlyPaths=/opt/openziti/share/console

ExecStartPre=/opt/openziti/etc/controller/entrypoint.bash check config.yml
ExecStart=/opt/openziti/bin/ziti controller run config.yml ${ZITI_ARGS}

Restart=always
RestartSec=3

LimitNOFILE=65535
UMask=0007

[Install]
WantedBy=multi-user.target

Controller1:

v: 3

cluster:
  dataDir:         "/var/lib/private/ziti-controller/raft"

identity:
  cert: ./pki/ctrl1/certs/server.chain.pem
  key: ./pki/ctrl1/keys/server.key
  ca: ./pki/ctrl1/certs/ctrl1.chain.pem

ctrl:
  listener: tls:0.0.0.0:6262
  options:
    advertiseAddress: tls:Ziti01.5G.MIL:6262

events:
  jsonLogger:
    subscriptions:
      - type: connect
      - type: cluster
    handler:
      type: file
      format: json
      path: /tmp/ziti-events.log

edge:
  api:
    address: Ziti01.5G.MIL:1280
  enrollment:
    signingCert:
      cert: pki/ctrl1/certs/ctrl1.cert
      key: pki/ctrl1/keys/ctrl1.key
    edgeIdentity:
      duration: 5m
    edgeRouter:
      duration: 5m

web:
  - name: all-apis-localhost
    bindPoints:
      - interface: 0.0.0.0:1280
        address: Ziti01.5G.MIL:1280
    options:
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
    apis:
      - binding: fabric
      - binding: edge-management
      - binding: edge-client
      - binding: edge-oidc
      - binding: zac
        options:
          location: /opt/openziti/share/console
          indexFile: index.html

Error When trying to connect two controllers not via systemd, Controller One :

ERROR transport/v2/tls.(*sharedListener).processConn [tls:0.0.0.0:1280]: {remote=[162.178.0.22:60438] error=[not handler for requested protocols [ziti-ctrl]]} handshake failed

Error When trying to connect two controllers not via systemd, Controller Two:

ERROR ziti/controller/raft/mesh.(*impl).Dial: {address=[tls:Ziti01.5G.MIL:1280] error=[error dialing peer tls:Ziti01.5G.MIL:1280: remote error: tls: internal error]} unable to get or connect raft peer channel

Systemd Error Log:

Starting OpenZiti Controller...
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437913]: realpath: missing operand
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437913]: Try 'realpath --help' for more information.
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437917]: realpath: missing operand
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437917]: Try 'realpath --help' for more information.
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: ERROR: database file '' is not writable
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: Provide a configuration in '/var/lib/private/ziti-controller' or generate with:
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: * Set vars in'/opt/openziti/etc/controller/bootstrap.env'
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: * Run '/opt/openziti/etc/controller/bootstrap.bash'
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: * Run 'systemctl enable --now ziti-controller.service'
Apr 14 08:03:39 Ziti01 entrypoint.bash[1437903]: WARN: set VERBOSE=1 or DEBUG=1 for more output

@TheLumberjack Any ideas why I may be seeing this?

running openssl s_client -connect I am getting key values mismatch

what would you recommend when creating certs for 3 separate VMs

this is what I ran values have been modified for sharing, Once created on the first VM I backed up the whole pki directory and moved it

# Create the trust root, a self-signed CA
ziti pki create ca --trust-domain 5G.MIL --pki-root ./pki --ca-file ca --ca-name 'HA Trust Root'

# Create the controller 1 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ctrl1 --intermediate-name 'Controller One Signing Cert'

# Create the controller 1 server cert
ziti pki create server --pki-root ./pki --ca-name ctrl1 --dns Ziti01.5G.MIL --ip 192.168.0.1 --server-name ctrl1 --spiffe-id 'controller/ctrl1'

# Create the controller 1 server cert
ziti pki create client --pki-root ./pki --ca-name ctrl1 --client-name ctrl1 --spiffe-id 'controller/ctrl1'

# Create the controller 2 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ctrl2 --intermediate-name 'Controller Two Signing Cert'

# Create the controller 2 server cert
ziti pki create server --pki-root ./pki --ca-name ctrl2 --dns Ziti02.5G.MIL --ip 192.168.0.2 --server-name ctrl2 --spiffe-id 'controller/ctrl2'

# Create the controller 2 client cert
ziti pki create client --pki-root ./pki --ca-name ctrl2 --client-name ctrl2 --spiffe-id 'controller/ctrl2'

# Create the controller 3 intermediate/signing cert
ziti pki create intermediate --pki-root ./pki --ca-name ca --intermediate-file ctrl3 --intermediate-name 'Controller Three Signing Cert'

# Create the controller 3 server cert
ziti pki create server --pki-root ./pki --ca-name ctrl3 --dns Ziti03.5G.MIL --ip 192.168.0.3 --server-name ctrl3 --spiffe-id 'controller/ctrl3'

# Create the controller 3 client cert
ziti pki create client --pki-root ./pki --ca-name ctrl3 --client-name ctrl3 --spiffe-id 'controller/ctrl3'

ZAC doesn't use mTLS, so it makes sense why ZAC would appear to work when the controllers don't...

Doing what you did does seem to me to be the correct type of flow you'd need to follow. How are you running the controller? Are you directly running it or running it from a systemd unit?

I would recommend you just run the ziti cli directly and see what happens. If it still fails, I would double check the config file is correct and referencing the proper paths/pki locations. I see the locations are relative, maybe even make them absolute and very clear?

can you provide the exact command you're running? Are you using the pki from one to connect to another?

I have been running via

ziti controller run config.yml

The exact command I ran was

openssl s_client -connect Ziti01.5G.MIL:6262 --cert "pki/ctrl2/certs/server.chain.pem" --key "pki/ctrl2/keys/server.key" --CAfile "pki/ctrl2/certs/ctrl2.chain.pem"

realized I didn't have ziti01 started after starting I get Verify return code 0 on Ziti02
and on Ziti01 I get error tls bad record mac handshake failed

ON Reboot I get i/o timeouts when running the openssl connect command

error receiving hello from address i/o timeout 
could not clear connection deadline

so you get "Verify return code 0" from ziti02 --> ziti01 but from ziti01 to ziti02 it fails using openssl s_client, do i have that correct?

And you're certain the running ziti01 controller is using the paths you use when using openssl? Please forgive me for asking this question repeatedly, but it's literally the only thing that I can think of that might be wrong or incorrect.

I don't know what the last message means wrt reboot. When this problem has happened to me in the past, i've always just had the wrong certs somehow... I'll have to think about this more. I don't know what/where/how it's gone wrong.

Hey @TheLumberjack I pulled in our system admin and he noticed the VM cpu usage jumped to 200 percent and software was crashing. He did some work to the host and now everything seems to be working the controllers have synced up. I honestly have no clue why that happened

I do still have the issue with starting from systemd I will verify the working area today and try to get that working.