Adding new router to ziti network fails

Hi folks,

I am scratching my head here and hope that you can give me a push into the right direction.

I have a newly set up ziti network right now only consisting in a router and a controller. I have been using the helm charts to install everything. All my stuff is on version 0.27.5. Control plane, Client API and my routerโ€™s edge API are all available publicly, using Ingress objects and proper TLS passthrough on the ingress controller. I can successfully dial into the openziti network with my laptop using the desktop edge and communicate to services inside the network.

I am now failing to add a simple additional private router. I did simply use the cli on the controller to issue ziti edge create edge-router secadm-int-router -o /tmp/my-private-router.jwt -t --no-traversal, then copy this jwt file to my laptop and issue the installation of the router in another Kubernetes cluster like this: helm install private-router -f private-router_values.yaml --set-file enrollmentJwt=my-private-router.jwt openziti/ziti-router.

My ziti-router.yml which is created by the helm chart looks like this:

v: 3
  cert:        ${ZITI_ROUTER_IDENTITY_DIR}/client.crt
  server_cert: ${ZITI_ROUTER_IDENTITY_DIR}/tls.crt
  key:         ${ZITI_ROUTER_IDENTITY_DIR}/tls.key
  ca:          ${ZITI_ROUTER_IDENTITY_DIR}/ca.crt


    - binding: transport
  - binding: edge
    address: tls:
        connectTimeoutMs: 1000
        getSessionTimeout: 60
  - binding: tunnel
        mode: host
                - localhost
    latencyProbeInterval: 10
    xgressDialQueueLength: 1000
    xgressDialWorkerCount: 128
    linkDialQueueLength: 1000
    linkDialWorkerCount: 32

The router starts up and at the first look it looks like itโ€™s working. In Ziti Console I see two green dots in front of this new routerโ€™s identity.

BUT: I see this when checking the fabric links from the controller:

ziti fabric list links
โ”‚ ID                     โ”‚ DIALER            โ”‚ ACCEPTOR    โ”‚ STATIC COST โ”‚ SRC LATENCY โ”‚ DST LATENCY โ”‚ STATE  โ”‚ STATUS โ”‚ FULL COST โ”‚
โ”‚ 2a1pwELa3rbUQ1y6zpZwsb โ”‚ private-router    โ”‚ core-router โ”‚           1 โ”‚   65000.0ms โ”‚   65000.0ms โ”‚ Failed โ”‚     up โ”‚    130001 โ”‚
results: 1-1 of 1

The private router logs this every minute:

[5048.000]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {routerVersion=[v0.27.5] linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj] address=[] linkProtocol=[tls]} dialing link
[5048.068]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5] linkId=[4b5Td6QI1TujofphEhhhyC]} link destination support heartbeats
[5048.068]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/4b5Td6QI1TujofphEhhhyC}->u{classic}->i{a2EE}]: {linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj]} link closed
[5048.130]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[5048.130]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/4b5Td6QI1TujofphEhhhyC]
[5048.130]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[4b5Td6QI1TujofphEhhhyC] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[5048.130]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/4b5Td6QI1TujofphEhhhyC}->u{classic}->i{XppP}]: {routerId=[Ff8oRvyqtj] linkId=[4b5Td6QI1TujofphEhhhyC]} link closed

And the other router, which should receive this connection says that:

[5491.461]   ERROR edge/router/xgress_edge.(*sessionConnectionHandler).HandleClose: {id=[4b5Td6QI1TujofphEhhhyC]} session connection handler encountered a HandleClose that did not have a SessionTokenHeader
[5491.461]   ERROR channel/v2.AcceptNextChannel.func1: {error=[no token attribute provided]} failure accepting channel edge with underlay u{classic}->i{a2EE}
[5491.524]   ERROR edge/router/xgress_edge.(*sessionConnectionHandler).HandleClose: {id=[4b5Td6QI1TujofphEhhhyC]} session connection handler encountered a HandleClose that did not have a SessionTokenHeader
[5491.524]   ERROR channel/v2.AcceptNextChannel.func1: {error=[no token attribute provided]} failure accepting channel edge with underlay u{classic}->i{XppP}

I donโ€™t see what I might have done wrong and what could be different from other setups except the fact that my controller and โ€œcore-routerโ€ are made public via a ingress and ingress controller, which I cannot yet see any error with.

Any hints that these log message might give you?

Thanks a lot in advance.


Thanks for all the details you provided. Itโ€™s helpful. Can you also show the โ€œlink.listenersโ€ section of the โ€œpublicโ€ edge router, the one that should have an advertised address? Iโ€™d like to verify the certificate presented is valid. Youโ€™ve probably done that, but I figured Iโ€™d check as well, and verify that the TLS passthrough is indeed proper. :slight_smile: It seems like you probably know this but it doesnโ€™t hurt to check even though I know your edge clients can connect to the edge listener.

From what you have described, that seems like itโ€™s the only possible problem or itโ€™s the TLS passthrough. If it were me, I would start by doing this process over, but I would take the โ€œprivate-sideโ€ kubernetes automation out of the equation and I would simply start a router up on your local laptop (or wherever you want) and verify that it starts up properly outside the โ€œprivateโ€ kubernetes automation. At least then weโ€™ll know on which side the problem is as in, is the problem on the public kubernetes cluster side, or is it somehow on the private cluster side.

That make sense and seem like a sensible test to you?

Hi @TheLumberjack!

Yes, it definitely makes sense to dig more into the ingress and certificate passthrough stuff, and yes, setting a router just on my laptop is another thing that I had in mind but for now discarded that because I donโ€™t see how the problem could be in the orchestration of starting up the router inside k8s.

Hereโ€™s the requested part of the public routerโ€™s config. domain names are changed due to privacy, but (at least i hope) I do this equally over the entire post:

    - binding: transport
    - binding:          transport
      bind:             tls:
        outQueueSize:   4

The Ingress object:

kind: Ingress
  name: ziti-core-router-edge
  namespace: openziti
  ingressClassName: nginx
  - host:
      - backend:
            name: ziti-core-router-edge
              number: 443
        path: /
        pathType: Prefix
    - ip:
    - ip:
    - ip:

I did two more tests: I did add a debug sidecar to the internal private router and issued curl and checked the logs of the public router, just to make sure the request really ends there. Yes it does! Of course, I get ugly SSL errors, but should be fine and by intention:

[10898.692]   ERROR channel/v2.(*classicListener).acceptConnection.func1 [tls:]: error receiving hello from [tls:] (receive error (local error: tls: bad record MAC))

So I did a connection test with openssl to see what certificate the public router spits out:

bash-5.1# openssl s_client -showcerts -connect
depth=2 CN = ziti-controller-edge-root
verify error:num=19:self signed certificate in certificate chain
verify return:1
depth=2 CN = ziti-controller-edge-root
verify return:1
depth=1 CN = ziti-controller-edge-signer
verify return:1
depth=0 C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj
verify return:1
Certificate chain
 0 s:C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj
   i:CN = ziti-controller-edge-signer
 1 s:CN = ziti-controller-edge-signer
   i:CN = ziti-controller-edge-root
 2 s:CN = ziti-controller-edge-root
   i:CN = ziti-controller-edge-root
Server certificate
subject=C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj

issuer=CN = ziti-controller-edge-signer

No client certificate CA names sent
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
SSL handshake has read 2657 bytes and written 427 bytes
Verification error: self signed certificate in certificate chain
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
No ALPN negotiated
Early data was not sent
Verify return code: 19 (self signed certificate in certificate chain)
140539129076552:error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate:ssl/record/rec_layer_s3.c:1543:SSL alert number 42

Doesnโ€™t look all bad, huh?

Also please note: I am successfully connected with a ziti desktop edge to the network and I am actually being able to consume services. โ€ฆ

The openssl command is what I wanted to run. If you want to DM me the actual name of the public router you can.

I wanted to see the X509v3 Subject Alternative Name from the final leaf cert and verify that itโ€™s โ€œโ€. I usually use this command for that:

openssl s_client -connect | openssl x509 -text | grep Alter -a1

Youโ€™ll see it returns:

            X509v3 Subject Alternative Name:

I expect that the first replacement you did is not actually different than, right?

I donโ€™t see how the problem could be in the orchestration of starting up the router inside k8s

I think itโ€™s worthwhile to still go through with connecting a manually provisioned router. I know you can connect your edge client but the PKI that get used for fabric links, is not necessarily the same as the PKI used for edge connections so it is possibly different.

โ€ฆ so I wrote all that then looked at your results one more time and i think mabye it DOES look wrong?

I looked at openssl s_client -showcerts -connect one more time and I noticed that the chain returned sure looks like it is for the CONTROLLER and not the router, right?

that openssl s_client should be returning a chain that is from the router I think? The SANS would show us that for sure though, so run that openssl command and get the X509v3 Subject Alternative Name section?

Hi and thanks for digging deep with me :slight_smile:

This is a nice way to see whether the right component answers, also, yes!

openssl s_client -connect | openssl x509 -text | grep Alter -a1

gives me:

            X509v3 Subject Alternative Name:
                DNS:localhost,, IP Address:

Again, domain name changed, but itโ€™s answering with the right one, I can confirm!

It actually is a different name, only accessible from internal network. This is the config of the private router, which I would like to connect edges to later on (maybe) but it should not be connected from other routers to form the router fabric, as it is not possible to connect from the public router to this private one.

OK, I will very probably try this tomorrow (today. :))โ€ฆ

This is strange. As the openssl above showed the right name. โ€ฆ but yes, it says โ€œcontrollerโ€ there a lot of times. โ€ฆ Maybe we DO have a problem with the helm chart here, but not with my private but with the public router. โ€ฆ but still: why can I consume services using my desktop edge then. โ€ฆ?

Yeah everything seems like it checks out properly. It definitely makes me think that the โ€œpublicโ€ cluster is indeed setup correctly. The steps you outline are exactly what I would do, the one small thing I personally donโ€™t do is use --no-traversal so, itโ€™s possible thatโ€™s having an effect that isnโ€™t obvious to me. Iโ€™d be interested in your โ€œnon-kubernetesโ€ test and if that has the same results, maybe removing the --no-traversal flag just to test if thatโ€™s somehow causing an issue.

Keeping it out of kubernetes just reduces some possible areas of complexityโ€ฆ For example when the router enrolls, it needs to write the PKI to the locations specified in the topmost identity section of the config. If one or more of those locations donโ€™t persist properly, it could cause problems. Doing it all locally would just eliminate a few of those type of variables.

Once you connect an edge router without going through kubernetes automation, thatโ€™ll give us more information to go by.

What you have done, seems like it should be ok. Thatโ€™s why Iโ€™m asking the โ€œdumbโ€ type questions, since it seems like you did it right to meโ€ฆ

Thanks, let me know how the โ€œlocalโ€ router install goes.

Good morning!

I have done a quick setup of a router on my notebook. I have used version 0.27.5 as this is the version I am using in Kubernetes (pull request to upgrade to 0.27.7 is still open), and then I tried with 0.27.7. Result is the same on both.

Long story short: same problem:

./ziti-router run router.yaml
[   0.022]    INFO ziti/ziti/ {revision=[3d9801e73809] go-version=[go1.19.5] os=[darwin] configFile=[router.yaml] build-date=[2023-02-13T21:41:19Z] arch=[amd64] routerId=[55Ls.Wy4PX] version=[v0.27.5]} starting ziti-router
[   0.025]    INFO fabric/router/forwarder.(*Scanner).run: started
[   0.025]    INFO fabric/router/forwarder.(*Faulter).run: started
[   0.027]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {maxQueueSize=[1000] minWorkers=[0] maxWorkers=[32] idleTime=[30s] poolType=[]} starting goroutine pool
[   0.027]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {idleTime=[30s] maxQueueSize=[1000] minWorkers=[0] maxWorkers=[128] poolType=[pool.route.handler]} starting goroutine pool
[   0.028] WARNING edge/router/internal/edgerouter.(*Config).LoadConfigFromMap: Invalid heartbeat interval [0] (min: 60, max: 10), setting to default [60]
[   0.030] WARNING edge/router/internal/edgerouter.parseEdgeListenerOptions: port in [listeners[0].options.advertise] must equal port in [listeners[0].address] for edge binding but did not. Got [443] [3022]
[   0.033]    INFO fabric/router.(*Router).initializeCtrlEndpoints: controller endpoints file [endpoints] doesn't exist. Using initial endpoints from config
[   0.034]    INFO fabric/router.(*Router).showOptions: ctrl = {"OutQueueSize":4,"MaxQueuedConnects":1,"MaxOutstandingConnects":16,"ConnectTimeout":1000000000,"DelayRxStart":false,"WriteTimeout":0}
[   0.034]    INFO fabric/router.(*Router).showOptions: metrics = {"ReportInterval":60000000000,"MessageQueueSize":10}
[   0.034]    INFO fabric/router.(*Router).initializeHealthChecks: starting health check with ctrl ping initially after 15s, then every 30s, timing out after 15s
[   0.035]    INFO fabric/router.(*Router).startXlinkDialers: started Xlink dialer with binding [transport]
[   0.037]    INFO edge/router/xgress_edge.(*listener).Listen: {address=[tls:]} starting channel listener
[   0.037]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {poolType=[pool.listener.xgress_edge] idleTime=[10s] maxQueueSize=[1] minWorkers=[1] maxWorkers=[16]} starting goroutine pool
[   0.039]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [edge] at [tls:]
[   0.039]    INFO edge/router/xgress_edge.(*Acceptor).Run: starting
[   0.039]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.040]    INFO fabric/router.(*Router).startControlPlane: router configured with 1 controller endpoints
[   0.040]    INFO fabric/router.(*Router).startControlPlane: connecting to controller at endpoing []
[   0.189]    INFO edge/router/fabric.(*StateManagerImpl).StartHeartbeat: heartbeat starting
[   0.190]    INFO edge/router/xgress_edge_tunnel.(*tunneler).Start: {mode=[host]} creating interceptor
[   0.190]    INFO edge/router/xgress_edge.(*CertExpirationChecker).Run: waiting 8615h59m16.577852s to renew certificates
[   0.195]    INFO edge/router/handler_edge_ctrl.(*helloHandler).HandleReceive.func1: received server hello, replying
[   0.198] WARNING edge/tunnel/dns.flushDnsCaches: {error=[exec: "resolvectl": executable file not found in $PATH]} unable to find systemd-resolve or resolvectl in path, consider adding a dns flush to your restart process
[   0.229]    INFO edge/router/handler_edge_ctrl.(*apiSessionAddedHandler).instantSync: {strategy=[instant]} first api session syncId [clfz6n2qjfu0e018q2c9pfinf], starting
[   0.230]    INFO edge/router/handler_edge_ctrl.(*apiSessionSyncTracker).Add: received api session sync chunk 0, isLast=true
[   0.586]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {routerId=[Ff8oRvyqtj] address=[] linkProtocol=[tls] routerVersion=[v0.27.5] linkId=[7LnlaVm2x8RcecYJumdNWx]} dialing link
[   0.838]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {routerVersion=[v0.27.5] linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj]} link destination support heartbeats
[   0.838]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/7LnlaVm2x8RcecYJumdNWx}->u{classic}->i{rNmM}]: {routerId=[Ff8oRvyqtj] linkId=[7LnlaVm2x8RcecYJumdNWx]} link closed
[   1.223]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[   1.224]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/7LnlaVm2x8RcecYJumdNWx]
[   1.224]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[7LnlaVm2x8RcecYJumdNWx] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[   1.224]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/7LnlaVm2x8RcecYJumdNWx}->u{classic}->i{XzKQ}]: {linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj]} link closed
[   1.231]    INFO edge/router/handler_edge_ctrl.(*apiSessionAddedHandler).applySync: finished sychronizing api sessions [count: 5, syncId: clfz6n2qjfu0e018q2c9pfinf, duration: 201.709ยตs]
[   2.028]    INFO edge/tunnel/intercept.SetDnsInterceptIpRange: dns intercept IP range: -

[  60.865]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {address=[] linkProtocol=[tls] routerVersion=[v0.27.5] linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj]} dialing link
[  60.981]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[  60.982]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/70VYEyDxMla3SExcN8kBCF}->u{classic}->i{dnR4}]: {routerId=[Ff8oRvyqtj] linkId=[70VYEyDxMla3SExcN8kBCF]} link closed
[  61.093]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[  61.093]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/70VYEyDxMla3SExcN8kBCF]
[  61.093]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[70VYEyDxMla3SExcN8kBCF] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[  61.094]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/70VYEyDxMla

There MUST be something wrong with the certificate chain on the public router, something wrong with the way the helm chart builds thatโ€ฆ I still wonder why another ROUTER canโ€™t connect to my public router but a desktop edge client can. โ€ฆ shouldnโ€™t that be the same mechanism behind?

I did another test with my still running ziti setup where I have published all necessary ports to the outside without Ingress Controller: I use โ€œโ€ with different ports and port forwarding there: 1290 would be the edge-router port.

I see two differences, one is that the SANs include the term โ€œcore-routerโ€ but this could be because of the different way the helm charts do the setup (We have set up this installation together with @marvkis using his helm charts that have been merged into the original ones but with changes), but worth to mention anyway:

๏ฃฟ ~/tmp/ziti-router-test/ openssl s_client -connect | openssl x509 -text | grep Alter -a1
depth=2 CN = ziti-signing-root-ca
verify error:num=19:self signed certificate in certificate chain
verify return:0

            X509v3 Subject Alternative Name:
                DNS:localhost,, DNS:core-router, IP Address:
8268038464:error:1404C412:SSL routines:ST_OK:sslv3 alert bad certificate:/AppleInternal/Library/BuildRoots/9e200cfa-7d96-11ed-886f-a23c4f261b56/Library/Caches/ alert number 42

โ€ฆand if I take the ladder part out of that command to see the certificate chain, I clearly see that it looks different from the one the new public router presents:

๏ฃฟ ~/tmp/ziti-router-test/ openssl s_client -connect
depth=2 CN = ziti-signing-root-ca
verify error:num=19:self signed certificate in certificate chain
verify return:0
write W BLOCK
Certificate chain
 0 s:/C=/ST=/L=/O=/OU=/CN=0Sa9Z5iuj
 1 s:/CN=ziti-signing-intermediate-ca
 2 s:/CN=ziti-signing-root-ca
Server certificate
No client certificate CA names sent
Server Temp Key: ECDH, X25519, 253 bits
SSL handshake has read 2647 bytes and written 381 bytes
New, TLSv1/SSLv3, Cipher is AEAD-CHACHA20-POLY1305-SHA256
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
    Protocol  : TLSv1.3
    Cipher    : AEAD-CHACHA20-POLY1305-SHA256
    Start Time: 1680427398
    Timeout   : 7200 (sec)
    Verify return code: 19 (self signed certificate in certificate chain)

I will have a deeper look later.

I have digged a bit into the public core router setup. I have noticed that 8 days ago the โ€œpost-install-jobโ€ apparently did fail. Unfortunately, I donโ€™t see what exactly went wrong, because the pod created from this job is already gone. This post-install-job is the following script:

#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
set -o xtrace

if kubectl -n openziti get secret \
  ziti-core-router-identity &>/dev/null; then
  if [[ ${HELM_UPGRADE:-} == true ]]; then
	echo 'INFO: no-op because secret exists and is Helm upgrade'
	# exit without error so Helm will delete the post-upgrade hook Job
	exit 0
	echo 'ERROR: secret exists: "ziti-core-router-identity"' >&2
	# this should never happen because Helm deletes the secret with pre-uninstall hook
	exit 1
  echo "INFO: identity secret does not exist, attempting router enrollment"


ziti router enroll \
  /etc/ziti/config/ziti-router.yaml \
  --jwt /etc/ziti/config/enrollment.jwt \

kubectl -n openziti create secret generic \
  ziti-core-router-identity \
  --from-file=client.crt="${ZITI_ROUTER_IDENTITY_DIR}/client.crt" \
  --from-file=tls.key="${ZITI_ROUTER_IDENTITY_DIR}/tls.key" \
  --from-file=tls.crt="${ZITI_ROUTER_IDENTITY_DIR}/tls.crt" \

So it MAY be something went wrong with the initial installation of the public router causing that it delivers the wrong certificate structure somehow?

BUT, the router works and communicates fine with the controller, and as said, edge tunnelers work fine also.

The certificates all exist. But I am not 100% sure which of them is used for what, exactly. I could dig deeper into these with openssl, but I lack the knowledge of what to expect.

The failed job could on the other hand have happened because of any other circumstance. The fact that I donโ€™t see the belonging pod and its logs anymore, makes that a dead end road to debug.

Itโ€™s a bit complicated. They arenโ€™t exactly the same mechanism. Routers connecting to one another are related to the overlay network itself. The clients connecting is โ€œedgeโ€ functionality and while each of them is the same insofar as mTLS, they are different PKI chains.

The last step of the script is the one that has me wondering. When a router enrolls those four files are written starting with the private key (tls.key). The ca is fetched from the controller (ca.crt), and two certificates are then made for client authentication and one for server auth.

Itโ€™s definitely difficult to troubleshoot this. Iโ€™m wondering if the router pod stopped/was deleted perhaps? Maybe thereโ€™s a bug around that side of things. Iโ€™ll have a chat with @qrkourier tomorrow as heโ€™s a bit closer to kubernetes than I am and I think Iโ€™ll have to talk to someone about the edge router pki too so I donโ€™t think weโ€™ll get too far today.

Your link listener is advertising the same port as your edge listener. So the link dial is reaching your edge listener instead of the link listener.


Woah eagle eyes! Yeah thatโ€™d definitely explain why edge clients are working while links are failing. Nice catch Paul

Thanks @plorenz and @TheLumberjack. I believe exactly there is the mistake. Today I had @marvkis looking over this thread with whom I did the initial setup and he said basically the same thing.

The culprit was this part in the public routerโ€™s config

    - binding: transport
    - binding:          transport
      bind:             tls:
        outQueueSize:   4

which was originated by the fact that we just did not take care about the linkListeners.transport part of the values.yaml, which seems to then do a fallback of using the โ€œglobalโ€ advertisedHost for both edge and transport. Kinda makes sense, but not here :slight_smile:

With this in values controlling the helm-chart, another nice Ingress is being created and my test router tries to connect to the right URL:

    advertisedPort: 443
      enabled: true
      ingressClassName: nginx
      annotations: "false" "true" "true"

Unfortunately, the privat router now still doesnโ€™t connect, but this time for a good reason:

[ 482.038]   ERROR fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {error=[error dialing outgoing link [l/3MbRdV8odRie7RMdIYLbec]: error dialing payload channel for [l/3MbRdV8odRie7RMdIYLbec]: tls: failed to verify certificate: x509: certificate is valid for localhost,, not] linkId=[3MbRdV8odRie7RMdIYLbec] routerId=[Ff8oRvyqtj] address=[] linkProtocol=[tls] routerVersion=[v0.27.5]} link dialing failed

Sure, the certificate auto-generated at creation time of the public router with the wrong host name is still containing the wrong host name. So, my next step would be to redeploy the public router and then see what happens.

Keep you posted!

So, as said in the last post, this was the problem!

With the new values.yaml containing an explicit advertisement host and port and ingress creation enabled for the router transport link, after uninstalling the public router, deleting, re-adding it in openziti controller via cli and re-installing it with a new JWT the certificates and their names are fine and another private router can connect to it without any problem. Just works!

Special thanks to @TheLumberjack who has supported me very much during this thread, and thanks to @plorenz and @marvkis for seeing what the problem really was.

Weekend is rescued!


Happy to hear you got everything sorted! Cheers