Router connection to Controller, handshake failed

Hiya, I've recently re-built my HA Ziti system using v1.6.0 for Controllers and Routers. I have been on v1.5.0 for a while.

My system consists of two HA Controllers and one public Edge Router.

I'm having an issue where after restarting an Edge Router, it's unable to connect to all HA controllers.

Here are the steps i take....

Once my HA Controller cluster is established i run the following to create an ER.

ziti edge create edge-router "edge-router-${RANDOM_STRING}" --jwt-output-file "${ZITI_HOME}/edge-router-${RANDOM_STRING}.jwt" --tunneler-enabled

Next i create the router configuration file.

v: 3

identity:
  cert:             "/var/lib/private/ziti-router/router.cert"
  server_cert:      "/var/lib/private/ziti-router/router.server.chain.cert"
  key:              "/var/lib/private/ziti-router/router.key"
  ca:               "/var/lib/private/ziti-router/router.cas"

ha:
  enabled: true

ctrl:
  endpoint:             tls:ziti-controller-1.az.lifeboat.ziti:8443

link:
  dialers:
    - binding: transport
  listeners:
    - binding:          transport
      bind:             tls:0.0.0.0:9443
      advertise:        tls:ziti-router-2.az.lifeboat.ziti:9443
      options:
        outQueueSize:   4

listeners:
  - binding: edge
    address: tls:0.0.0.0:443
    options:
      advertise: ziti-router-2.az.lifeboat.ziti:443
      connectTimeoutMs: 5000
      getSessionTimeout: 60
  - binding: tunnel
    options:
      mode: host #tproxy|host

edge:
  csr:
    country: US
    province: NC
    locality: Charlotte
    organization: NetFoundry
    organizationalUnit: Ziti
    sans:
      dns:
        - localhost
        - ziti-router-2.az.lifeboat.ziti
        - ziti-router-2
      ip:
        - "127.0.0.1"
        - "::1"

forwarder:
  latencyProbeInterval: 0
  xgressDialQueueLength: 1000
  xgressDialWorkerCount: 128
  linkDialQueueLength: 1000
  linkDialWorkerCount: 32
  rateLimitedQueueLength: 100
  rateLimitedWorkerCount: 25

Next i enrol the Edge Router

ziti router enroll ${ZITI_HOME}/config.yml --jwt "${ZITI_HOME}/edge-router-${RANDOM_STRING}.jwt"

Finally i start the router service systemctl start ziti-router.service

I see the following separate log lines Which suggest the ER is able to connect to both my HA Controllers.

Apr 11 11:45:48 ziti-router-2 ziti[1347]: {"endpoint":"tls:ziti-controller-1.az.lifeboat.ziti:8443","file":"github.com/openziti/ziti/router/env/ctrls.go:203","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func3","level":"info","msg":"successfully connected to controller","time":"2025-04-11T11:45:48.722Z"}
Apr 11 11:45:58 ziti-router-2 ziti[1347]: {"endpoint":"tls:ziti-controller-2.az.lifeboat.ziti:8443","file":"github.com/openziti/ziti/router/env/ctrls.go:203","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func3","level":"info","msg":"successfully connected to controller","time":"2025-04-11T11:45:58.735Z"}

Also, when i run ziti edge list edge-routers on either of my Controllers, The ER ONLINE state is true.

╭────────────┬───────────────────┬────────┬───────────────┬──────┬────────────╮
│ ID         │ NAME              │ ONLINE │ ALLOW TRANSIT │ COST │ ATTRIBUTES │
├────────────┼───────────────────┼────────┼───────────────┼──────┼────────────┤
│ 8F3ku2Kbfa │ edge-router-ljxvi │ true   │ true          │    0 │            │
╰────────────┴───────────────────┴────────┴───────────────┴──────┴────────────╯

My problem starts when i restart the ER with systemctl restart ziti-router.service. After this, it seems my ER is only able to connect to one of my HA Controllers ziti-controller-2.

In the ER log i see.

Apr 11 12:01:50 ziti-router-2 ziti[1631]: {"endpoint":"tls:ziti-controller-2.az.lifeboat.ziti:8443","file":"github.com/openziti/ziti/router/env/ctrls.go:203","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func3","level":"info","msg":"successfully connected to controller","time":"2025-04-11T12:01:50.612Z"}

But i see the following which suggests it's unable to connect to ziti-controller-1.

Apr 11 12:04:18 ziti-router-2 ziti[1631]: {"endpoint":"tls:ziti-controller-1.az.lifeboat.ziti:8443","error":"error connecting ctrl (EOF)","file":"github.com/openziti/ziti/router/env/ctrls.go:192","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func2","level":"error","msg":"unable to connect controller","time":"2025-04-11T12:04:18.639Z"}

In the ziti-controller-1 logs i see the following log repeating.

Apr 11 12:04:58 ziti-controller-1 ziti[3445]: {"_context":"tls:0.0.0.0:8443","file":"github.com/openziti/channel/v4@v4.0.4/classic_listener.go:213","func":"github.com/openziti/channel/v4.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:10.128.6.4:34640] (x509: certificate signed by unknown authority)","time":"2025-04-11T12:04:58.921Z"}

Now when i run ziti edge list edge-routers on either Controller. I see the ONLINE state is false on ziti-controller-1 and true on ziti-controller-2.

From reading the Changelog i can't tell what might have caused this issue for me or if i need to change my install process. Currently i don't experience this issue if i run v1.5.0 but i do experience it on v1.5.4 and v.1.6.0.

Thanks in advance !

Hi @farmhouse

I have a few things to check if we can narrow this down:

Can you try the following:

  1. Bring the router up for the first time
  2. Don't restart, but wait a few minutes
  3. Mark the router as disabled. This should force the controllers to disconnect it
  4. Re-enable the router.

Does it reconnect to both? This will test if it's any reconnect that will have an issue or only restarts.

Are the certs getting renewed? Do you see the cert files change after after the initial startup or after the restart?

Finally, I would not use v1.6.0 yet, as it hasn't been released yet.

Paul

Thanks @plorenz.

I have just tried the suggested steps on v1.5.4. It seems i do encounter this issue on reconnects and restarts.

I've also simplified my install process for the purpose of reproducing my issue.

Here's my install script in full, this time it's based on the HA quick start mostly grabbed from here. I just run it on a single Debian VM.

ZITI_CLI_DEB_VER=1.5.4
ZITI_CONTROLLER_DEB_VER=1.5.4
ZITI_ROUTER_DEB_VER=1.5.4

# Install OZ packages
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get -o Dpkg::Options::="--force-confdef" --allow-downgrades --allow-remove-essential --allow-change-held-packages -fuy dist-upgrade
apt-get install -y gnupg curl wget
curl -sSLf https://get.openziti.io/tun/package-repos.gpg | gpg --dearmor --output /usr/share/keyrings/openziti.gpg
chmod a+r /usr/share/keyrings/openziti.gpg
echo "deb [signed-by=/usr/share/keyrings/openziti.gpg] https://packages.openziti.org/zitipax-openziti-deb-test debian main" > /etc/apt/sources.list.d/openziti-release.list
apt-get update
apt-get install -y openziti=${ZITI_CLI_DEB_VER} openziti-controller=${ZITI_CONTROLLER_DEB_VER} openziti-router=${ZITI_ROUTER_DEB_VER} openziti-console --allow-downgrades

# Set up HA ctrl1
export TRUST_DOMAIN="example.trust.domain"
export ZITI_PWD="password"
export ZITI_INST="ctrl1"
export ZITI_CTRL_PORT="6400"
export ZITI_ROUTER_PORT="6401"
export ZITI_INITIAL_CTRL="tls:ctrl1.${TRUST_DOMAIN}:${ZITI_CTRL_PORT}"
sudo chown ziti:ziti /sharedfs/
echo "127.0.0.1 ctrl1.${TRUST_DOMAIN}" >> /etc/hosts
echo "127.0.0.1 ctrl2.${TRUST_DOMAIN}" >> /etc/hosts
echo "127.0.0.1 rout1.${TRUST_DOMAIN}" >> /etc/hosts
ziti edge quickstart ha \
    --instance-id="ctrl1" \
    --ctrl-port="${ZITI_CTRL_PORT}" \
    --router-port="${ZITI_ROUTER_PORT}" \
    --home="/sharedfs/ziti" \
    --ctrl-address="${ZITI_INST}.${TRUST_DOMAIN}" \
    --router-address="${ZITI_INST}.${TRUST_DOMAIN}" \
    --trust-domain="${TRUST_DOMAIN}" \
    --password $ZITI_PWD \
    &> ctrl1.log &

# Wait for ctrl1 to finish setup
sleep 30

# Set up HA ctrl2
export TRUST_DOMAIN="example.trust.domain"
export ZITI_PWD="password"
export ZITI_INST="ctrl2"
export ZITI_INITIAL_CTRL="tls:ctrl1.${TRUST_DOMAIN}:6400"
export ZITI_CTRL_PORT="6500"
export ZITI_ROUTER_PORT="6501"

mkdir -p "/tmp/${ZITI_INST}/pki/root-ca/keys"
mkdir -p "/tmp/${ZITI_INST}/pki/root-ca/certs"
cp /sharedfs/ziti/pki/root-ca/keys/root-ca.key /tmp/${ZITI_INST}/pki/root-ca/keys/
cp /sharedfs/ziti/pki/root-ca/certs/root-ca.cert /tmp/${ZITI_INST}/pki/root-ca/certs/
cp /sharedfs/ziti/pki/root-ca/index.txt /tmp/${ZITI_INST}/pki/root-ca/index.txt

ziti edge quickstart join \
    --instance-id "${ZITI_INST}" \
    --ctrl-port "${ZITI_CTRL_PORT}" \
    --router-port "${ZITI_ROUTER_PORT}" \
    --home "/tmp/${ZITI_INST}" \
    --ctrl-address="${ZITI_INST}.${TRUST_DOMAIN}" \
    --router-address="${ZITI_INST}.${TRUST_DOMAIN}" \
    --trust-domain="${TRUST_DOMAIN}" \
    --cluster-member "${ZITI_INITIAL_CTRL}" \
    --password $ZITI_PWD \
    &> ctrl2.log &

# Wait for ctrl2 to finish setup
sleep 30

# Set up router
ziti edge login -p ${ZITI_PWD}

ziti edge create edge-router "edge-router-test-1" --jwt-output-file "/tmp/edge-router-test-1.jwt" --tunneler-enabled

echo "ZITI_CTRL_ADVERTISED_ADDRESS='ctrl1.${TRUST_DOMAIN}'" > /opt/openziti/etc/router/bootstrap.env
echo "ZITI_CTRL_ADVERTISED_PORT='6400'" >> /opt/openziti/etc/router/bootstrap.env
echo "ZITI_ROUTER_ADVERTISED_ADDRESS='rout1.${TRUST_DOMAIN}'" >> /opt/openziti/etc/router/bootstrap.env
echo "ZITI_ROUTER_PORT='3999'" >> /opt/openziti/etc/router/bootstrap.env
echo "ZITI_ENROLL_TOKEN='/tmp/edge-router-test-1.jwt'" >> /opt/openziti/etc/router/bootstrap.env

/opt/openziti/etc/router/bootstrap.bash

systemctl enable --now ziti-router.service


# verify traffic 
ziti ops verify traffic -p ${ZITI_PWD} 

After 5 minutes of ziti-router.service uptime i disable the router.

ziti fabric update router edge-router-test-1 --disabled

Confirmed disconnection, Waited 60s, re-enable the router

ziti fabric update router edge-router-test-1 --disabled=false

At this point the router certs have not been renewed.

Now is when i start to see the router connection errors in the controller logs.

{"_context":"tls:0.0.0.0:6500","file":"github.com/openziti/channel/v3@v3.0.39/classic_listener.go:219","func":"github.com/openziti/channel/v3.(*classicListener).acceptConnection.func1","level":"error","msg":"connection handler error for [tls:127.0.0.1:57384] (x509: certificate signed by unknown authority)","time":"2025-04-14T12:03:34.040Z"}

FYI @plorenz The Linux router has default run params ziti router run config.yml --extend, and there was a recent bugfix for --extend because it was failing silently resulting in the router's leaf certs not being renewed at startup. I'm unsure if the problem extended to the default renewal timer, and I suspect it did also prevent that from succeeding.

@farmhouse If you decide it's necessary, the Linux router's run params can be influenced to disable cert renewal at startup by setting ZITI_ARGS.

# default value
❯ grep ZITI_ARGS /opt/openziti/etc/router/service.env
ZITI_ARGS='--extend'
 
# startup renewal disabled
❯ grep ZITI_ARGS /opt/openziti/etc/router/service.env
ZITI_ARGS=''
1 Like

@qrkourier ah, nice catch, thank you!

@farmhouse let us know if changing that startup flag resolves the issue. The fix should be in the next release.

Paul

1 Like

Thanks @plorenz @qrkourier,

This has fixed my issue.