Openziti edge router online issue in an HA setup

1VeryNaughtyCat · May 7, 2025, 4:06am

Hello, Thank everyone for your support in this community
Setup:
Three HA controller nodes:

ctrl1.ziti.home.lab:7878
ctrl2.ziti.home.lab:7878
ctrl3.ziti.home.lab:7878

The PKI was created by the doc/ha/create-pki.sh

All nodes are online and ctrl1 is the leader
╭───────┬──────────────────────────────┬───────┬────────┬─────────┬───────────╮
│ ID │ ADDRESS │ VOTER │ LEADER │ VERSION │ CONNECTED │
├───────┼──────────────────────────────┼───────┼────────┼─────────┼───────────┤
│ ctrl1 │ tls:ctrl1.ziti.home.lab:7979 │ true │ true │ v1.5.4 │ true │
│ ctrl2 │ tls:ctrl2.ziti.home.lab:7979 │ true │ false │ v1.5.4 │ true │
│ ctrl3 │ tls:ctrl3.ziti.home.lab:7979 │ true │ false │ v1.5.4 │ true │
╰───────┴──────────────────────────────┴───────┴────────┴─────────┴───────────╯
I have successfully been able to register and verify a router and bring it online.

However, if I restart the router, the router will not come back online. The endpoints.yml is populated with the controllers list above.

The errors that I am getting in the router console are as follows:

ziti-router-1 | {"endpoint":"tls:ctrl3.ziti.home.lab:1280","error":"error connecting ctrl (remote error: tls: internal error)","file":"github.com/openziti/ziti/router/env/ctrls.go:192","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func2","level":"error","msg":"unable to connect controller","time":"2025-05-07T04:00:02.542Z"}
ziti-router-1 | {"endpoint":"tls:ctrl2.ziti.home.lab:1280","error":"error connecting ctrl (remote error: tls: internal error)","file":"github.com/openziti/ziti/router/env/ctrls.go:192","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func2","level":"error","msg":"unable to connect controller","time":"2025-05-07T04:00:02.667Z"}
ziti-router-1 | {"endpoint":"tls:ctrl2.ziti.home.lab:1280","error":"error connecting ctrl (remote error: tls: internal error)","file":"github.com/openziti/ziti/router/env/ctrls.go:192","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func2","level":"error","msg":"unable to connect controller","time":"2025-05-07T04:00:03.776Z"}
ziti-router-1 | {"endpoint":"tls:ctrl1.ziti.home.lab:1280","error":"error connecting ctrl (remote error: tls: internal error)","file":"github.com/openziti/ziti/router/env/ctrls.go:192","func":"github.com/openziti/ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func2","level":"error","msg":"unable to connect controller","time":"2025-05-07T04:00:03.919Z"}

and errors in the controller console:
[4750.475] ERROR channel/v3.(*classicListener).acceptConnection.func1 [tls:0.0.0.0:7979]: connection handler error for [tls:192.168.1.225:41858] (x509: certificate signed by unknown authority)

Any help would be great.

Cheers

dmuensterer · May 7, 2025, 8:13am

I stumbled over this issue as well. However I'm surprised that your controllers were able to "clusterize".
In my opinion there is an issue in the Controller Certificates page:
The commands create client and server certs/keys, however they are missing an important step: Server and client cert need to have the same private key.

So instead of

ziti pki create server --pki-root ./pki --ca-name ctrl1 --dns "localhost,ctrl1.ziti.example.com" --ip "127.0.0.1,::1" --server-name ctrl1 --spiffe-id 'controller/ctrl1'
ziti pki create client --pki-root ./pki --ca-name ctrl1 --client-name ctrl1 --spiffe-id 'controller/ctrl1'

it should much rather be

ziti pki create server --pki-root ./pki --ca-name ctrl1 --dns "localhost,ctrl1.ziti.example.com" --ip "127.0.0.1,::1" --server-name ctrl1 --spiffe-id 'controller/ctrl1'
ziti pki create client --pki-root ./pki --ca-name ctrl1 --client-name ctrl1 --spiffe-id 'controller/ctrl1' --key-file server # <-- notice the --key-file server

When not changed, the ziti pki create server command will create a private key called server.key. So when creating the client certificate you need to specify the same key which was used for the server cert.

@TheLumberjack Please correct me if I'm wrong, however this was my observation.

TheLumberjack · May 7, 2025, 12:03pm

It's cleaner imo to have one key, but i wouldn't think it would matter in practicality. What should matter is that the claims present in the cert offered to connect with are correct and proper, and that the cert is signed by an expected CA. I've not personally hit this sort of problem myself where restarting the router causes this issue.

@1VeryNaughtyCat are you perhaps using a 'deployment' based distribution? It might be relevant.

1VeryNaughtyCat · May 7, 2025, 10:02pm

Hello, thank you @TheLumberjack and @dmuensterer for responding.

I have recreated the PKI chain, and brought the controllers online.
I have enrolled an new edge router. Same thing happen verified and online however restarting the router displayed the same issue. not online.

However I slowly read all the logs from the router and saw that it was indeed online and was talking to ctrl03.

Opening up ZAC on ctrl03 (not the cluster leader at that time), the router is displayed online. However the status is not displayed on the other two ZAC's

I will try using the CLI to see if it is a ZAC. thing.

I have tested creating identities when the cluster is in differing states and they are happily replicated.
I have also tested taking down CTRL03 and Router will not connect back to remaining cluster nodes.
I restart CTRL03 and the router came online. I am not sure why the data is not being replicated to the nodes so that ZAC can display.

I have also noticed and interesting thing with the cluster status. If you connect to the cluster leader and query the status of the cluster, you see all nodes connected, and their status. However if you connect to either of the other slave nodes, and query, you see the leader and yourself are connected but the other node is "Not connected" Restarting the leader does see the leader shift to another voting node.

Not sure if this is usual behaviour for the cluster.

@TheLumberjack, sorry I don't understand the question. I am using Ubuntu 24.01 server and docker. Is that what you are asking?

Thank you for your help.

Cheers

TheLumberjack · May 8, 2025, 1:25pm

That is interesting. I wouldn't expect that, myself.

Yes that is what I was asking. Are you using docker compose or just singular docker commands to start your network?

If we ignore the odd reporting of the state of the overlay when querying different controllers, is there any malfunction? Or is the problem right now just the inconsistencies of the different controllers? Can you level set me on what the 'problem' is currently? I feel like I've lost the thread.

plorenz · May 8, 2025, 2:08pm

This is expected. The controllers in the cluster only need a connection to the leader. This means you will often see that followers are not connected to each.

It might be less confusing if we showed 'is connected to leader' instead of is connected, but unfortunately we don't always know the state of other followers from a follower.

Paul

1VeryNaughtyCat · May 8, 2025, 11:48pm

Hello, Thank you.
@plorenz, thank you for clarifying the the cluster status.
@TheLumberjack, sorry.
I am using the following:

The controllers are in a HA cluster and there is quorum.
docker and compose for both controllers and routers.
I am using crafted controller.conf, instead of the bootstrap version. say that i did craft the controller configs from the HA documentation in your github. Using ansible to modify the controller configs for deployment.
The cluster seems stable and I have tested multiple failure states and the cluster comes back online.

The router issue is as follows:

you can create a new router, verify the identity and bring the router online however if you reboot the router, the router will come online again, but the you need to find which controller the controller has registered with. This is not guarantee to be the leader of the of the cluster.
only the ZAC of the router registered controller will show the status of the router.
the cli tooling will only show the status of the router on the registered controller and the other controllers say that the router is not online.
controller data is not updating the the other clustered nodes with the status of the router
if you shutdown the controller that the router has registered with, you lose that router as it will not connect to any other controller in the cluster, even if you reboot the router.
the router is aware of the all the nodes in the cluster.

Thank you for your time.

Cheers

1VeryNaughtyCat · May 12, 2025, 3:44am

Hello, Thank you for your support.
I have decided to go down another route and abandon using openziti HA for the near future.
I have successfully created a failover environment using k3s cluster and longhorn for storage replication.

Working great so far .

Cheers

farmhouse · May 19, 2025, 9:04am

The symptoms here appear to be same as my issue, resolution is in this post.

Topic		Replies	Views
Router connection to Controller, handshake failed	5	72	April 16, 2025
Building Openziti LAB with VMware General Questions	5	82	December 16, 2024
HA implementation questions	4	76	February 25, 2025
Trouble shooting starting a remote public edge router Building/Development	35	1855	August 7, 2022
Handshake failed error	1	47	March 10, 2025

Openziti edge router online issue in an HA setup

Related topics