Adding new router to ziti network fails

Hi folks,

I am scratching my head here and hope that you can give me a push into the right direction.

I have a newly set up ziti network right now only consisting in a router and a controller. I have been using the helm charts to install everything. All my stuff is on version 0.27.5. Control plane, Client API and my routerโ€™s edge API are all available publicly, using Ingress objects and proper TLS passthrough on the ingress controller. I can successfully dial into the openziti network with my laptop using the desktop edge and communicate to services inside the network.

I am now failing to add a simple additional private router. I did simply use the cli on the controller to issue ziti edge create edge-router secadm-int-router -o /tmp/my-private-router.jwt -t --no-traversal, then copy this jwt file to my laptop and issue the installation of the router in another Kubernetes cluster like this: helm install private-router -f private-router_values.yaml --set-file enrollmentJwt=my-private-router.jwt openziti/ziti-router.

My ziti-router.yml which is created by the helm chart looks like this:

v: 3
identity:
  cert:        ${ZITI_ROUTER_IDENTITY_DIR}/client.crt
  server_cert: ${ZITI_ROUTER_IDENTITY_DIR}/tls.crt
  key:         ${ZITI_ROUTER_IDENTITY_DIR}/tls.key
  ca:          ${ZITI_ROUTER_IDENTITY_DIR}/ca.crt

ctrl:
  endpoint:    tls:ctrlplane.sdn.my.org:443

link:
  dialers:
    - binding: transport
listeners:
  - binding: edge
    address: tls:0.0.0.0:3022
    options:
        advertise: ziti-router.kube.my.org:443
        connectTimeoutMs: 1000
        getSessionTimeout: 60
  - binding: tunnel
    options:
        mode: host
edge:
    csr:
        sans:
            dns:
                - localhost
                - ziti-router.kube.my.org
            ip:
                - 127.0.0.1
forwarder:
    latencyProbeInterval: 10
    xgressDialQueueLength: 1000
    xgressDialWorkerCount: 128
    linkDialQueueLength: 1000
    linkDialWorkerCount: 32

The router starts up and at the first look it looks like itโ€™s working. In Ziti Console I see two green dots in front of this new routerโ€™s identity.

BUT: I see this when checking the fabric links from the controller:

ziti fabric list links
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ID                     โ”‚ DIALER            โ”‚ ACCEPTOR    โ”‚ STATIC COST โ”‚ SRC LATENCY โ”‚ DST LATENCY โ”‚ STATE  โ”‚ STATUS โ”‚ FULL COST โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 2a1pwELa3rbUQ1y6zpZwsb โ”‚ private-router    โ”‚ core-router โ”‚           1 โ”‚   65000.0ms โ”‚   65000.0ms โ”‚ Failed โ”‚     up โ”‚    130001 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
results: 1-1 of 1

The private router logs this every minute:

[5048.000]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {routerVersion=[v0.27.5] linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj] address=[tls:router-edge.sdn.my.org:443] linkProtocol=[tls]} dialing link
[5048.068]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5] linkId=[4b5Td6QI1TujofphEhhhyC]} link destination support heartbeats
[5048.068]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/4b5Td6QI1TujofphEhhhyC}->u{classic}->i{a2EE}]: {linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj]} link closed
[5048.130]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[4b5Td6QI1TujofphEhhhyC] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[5048.130]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/4b5Td6QI1TujofphEhhhyC]
[5048.130]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[4b5Td6QI1TujofphEhhhyC] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[5048.130]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/4b5Td6QI1TujofphEhhhyC}->u{classic}->i{XppP}]: {routerId=[Ff8oRvyqtj] linkId=[4b5Td6QI1TujofphEhhhyC]} link closed

And the other router, which should receive this connection says that:

[5491.461]   ERROR edge/router/xgress_edge.(*sessionConnectionHandler).HandleClose: {id=[4b5Td6QI1TujofphEhhhyC]} session connection handler encountered a HandleClose that did not have a SessionTokenHeader
[5491.461]   ERROR channel/v2.AcceptNextChannel.func1: {error=[no token attribute provided]} failure accepting channel edge with underlay u{classic}->i{a2EE}
[5491.524]   ERROR edge/router/xgress_edge.(*sessionConnectionHandler).HandleClose: {id=[4b5Td6QI1TujofphEhhhyC]} session connection handler encountered a HandleClose that did not have a SessionTokenHeader
[5491.524]   ERROR channel/v2.AcceptNextChannel.func1: {error=[no token attribute provided]} failure accepting channel edge with underlay u{classic}->i{XppP}

I donโ€™t see what I might have done wrong and what could be different from other setups except the fact that my controller and โ€œcore-routerโ€ are made public via a ingress and ingress controller, which I cannot yet see any error with.

Any hints that these log message might give you?

Thanks a lot in advance.

Christian

Thanks for all the details you provided. Itโ€™s helpful. Can you also show the โ€œlink.listenersโ€ section of the โ€œpublicโ€ edge router, the one that should have an advertised address? Iโ€™d like to verify the certificate presented is valid. Youโ€™ve probably done that, but I figured Iโ€™d check as well, and verify that the TLS passthrough is indeed proper. :slight_smile: It seems like you probably know this but it doesnโ€™t hurt to check even though I know your edge clients can connect to the edge listener.

From what you have described, that seems like itโ€™s the only possible problem or itโ€™s the TLS passthrough. If it were me, I would start by doing this process over, but I would take the โ€œprivate-sideโ€ kubernetes automation out of the equation and I would simply start a router up on your local laptop (or wherever you want) and verify that it starts up properly outside the โ€œprivateโ€ kubernetes automation. At least then weโ€™ll know on which side the problem is as in, is the problem on the public kubernetes cluster side, or is it somehow on the private cluster side.

That make sense and seem like a sensible test to you?

Hi @TheLumberjack!

Yes, it definitely makes sense to dig more into the ingress and certificate passthrough stuff, and yes, setting a router just on my laptop is another thing that I had in mind but for now discarded that because I donโ€™t see how the problem could be in the orchestration of starting up the router inside k8s.

Hereโ€™s the requested part of the public routerโ€™s config. domain names are changed due to privacy, but (at least i hope) I do this equally over the entire post:

link:
  dialers:
    - binding: transport
  listeners:
    - binding:          transport
      bind:             tls:0.0.0.0:10080
      advertise:        tls:router-edge.sdn.my.org:443
      options:
        outQueueSize:   4

The Ingress object:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  labels:
  name: ziti-core-router-edge
  namespace: openziti
spec:
  ingressClassName: nginx
  rules:
  - host: router-edge.sdn.my.org
    http:
      paths:
      - backend:
          service:
            name: ziti-core-router-edge
            port:
              number: 443
        path: /
        pathType: Prefix
status:
  loadBalancer:
    ingress:
    - ip: 192.168.232.30
    - ip: 192.168.232.31
    - ip: 192.168.232.32

I did two more tests: I did add a debug sidecar to the internal private router and issued curl https://router-edge.sdn.my.org and checked the logs of the public router, just to make sure the request really ends there. Yes it does! Of course, I get ugly SSL errors, but should be fine and by intention:

[10898.692]   ERROR channel/v2.(*classicListener).acceptConnection.func1 [tls:0.0.0.0:3022]: error receiving hello from [tls:192.168.232.32:42881] (receive error (local error: tls: bad record MAC))

So I did a connection test with openssl to see what certificate the public router spits out:

bash-5.1# openssl s_client -showcerts -connect  router-edge.sdn.my.org:443
CONNECTED(00000003)
depth=2 CN = ziti-controller-edge-root
verify error:num=19:self signed certificate in certificate chain
verify return:1
depth=2 CN = ziti-controller-edge-root
verify return:1
depth=1 CN = ziti-controller-edge-signer
verify return:1
depth=0 C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj
verify return:1
---
Certificate chain
 0 s:C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj
   i:CN = ziti-controller-edge-signer
-----BEGIN CERTIFICATE-----
MIIDtjCCA1ugAwIBAgIDBmFnMAoGCCqGSM49BAMCMCYxJDAiBgNVBAMTG3ppdGkt
Y29udHJvbGxlci1lZGdlLXNpZ25lcjAeFw0yMzAzMjQyMTA0NTZaFw0yNDAzMjQy
MTA1NTZaMEwxCTAHBgNVBAYTADEJMAcGA1UECBMAMQkwBwYDVQQHEwAxCTAHBgNV
BAoTADEJMAcGA1UECxMAMRMwEQYDVQQDEwpGZjhvUnZ5cXRqMIICIjANBgkqhkiG
9w0BAQEFAAOCAg8AMIICCgKCAgEAyMw4yCI7JTqvuI8pl1dm2hb+5Ve/xnyEpICe
aX4n19mllMuKHgqZwk/XqO/XPN7PpIGf8iSEnq6PnFYJexKHoFZXS5dSKo84i9xV
wbzCjp6DPct6OSouznJ5qP55nuJbOkYa6p14DxJXVYQaxZm14M68H0K5cn5Oktph
UWyRIlvR+k29jeCU1D1ySLFWwp+n4NYfAaWwY3lfoq2hy7ED8IvGU/+MBh46SO7a
Hw3GOTAo7Qo+c5y9MODSgSqJmBkky/h19auPBpAxj2tFemfzSSt0O4f0FAaZ5j3C
JyBrKxfyjEw6K2BtHrKFioHrpPaQTJdJJ1E6N7Wurg0nGYCpAOnUiyLGHy1uajtj
rJIYHRueu7+4Q+7NCrnoad804tnnglrefUEqQ2R3mNOac3HBwgPyn8oJQrp1BlYH
U7g97Qh56e6PaFCm+dGU/v+vFXvSfmUR4RfS7UOrnvXnkag+tp4J6TnES6A/5/WU
45Ojf72JgrAuzCZExPl0N0VrsKZtS7t5Dgc3ndGm/etZZEHVTcPmHKrqSWWM8iSs
BvTPKgoXYyYMhPSBLp4VaxiDGOhuYbFphQt8V3P17fUbj21SEn9U5vjd/VqRa+vt
Lxct6Y87K0XNefpuDae/isThIZ17gtB21IkiK6I1tz2qOvR0nYGpyS9edqrLGCOU
6Rh5V7UCAwEAAaOBhjCBgzAOBgNVHQ8BAf8EBAMCBLAwEwYDVR0lBAwwCgYIKwYB
BQUHAwEwHwYDVR0jBBgwFoAUwLun050imKSFsDleQoI6NEEZ5aAwOwYDVR0RBDQw
MoIJbG9jYWxob3N0gh9yb3V0ZXItZWRnZS5zZG4uZW50aHVzLnNlcnZpY2VzhwR/
AAABMAoGCCqGSM49BAMCA0kAMEYCIQDLhgCmyVqwTWchy2nWqrrzc+HEHhcHq5+P
v4XMVONfRQIhAJWonBBUj3iRT2POiIp0e70tueUxgo0qBU5DJP/iEBZZ
-----END CERTIFICATE-----
 1 s:CN = ziti-controller-edge-signer
   i:CN = ziti-controller-edge-root
-----BEGIN CERTIFICATE-----
MIIBrDCCAVKgAwIBAgIRAPXB5W8T7DEfZtrew4yoXTYwCgYIKoZIzj0EAwIwJDEi
MCAGA1UEAxMZeml0aS1jb250cm9sbGVyLWVkZ2Utcm9vdDAeFw0yMzAzMjQxODM0
MjVaFw0zMzAzMzExODM0MjVaMCYxJDAiBgNVBAMTG3ppdGktY29udHJvbGxlci1l
ZGdlLXNpZ25lcjBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABIpoWXqDQJXsuKy9
4hZ8bAd+K0+AvbV8HStELfJn27j9uxvPTK7Cx2tujdzvfRug5T0i7ueCClnLAJRv
OO6ML0ujYzBhMA4GA1UdDwEB/wQEAwIBhjAPBgNVHRMBAf8EBTADAQH/MB0GA1Ud
DgQWBBTAu6fTnSKYpIWwOV5Cgjo0QRnloDAfBgNVHSMEGDAWgBRBfOeY41RmOiaV
1n9K3e1GAjU/nTAKBggqhkjOPQQDAgNIADBFAiAo11qsjm7VhERX8tWwEjWRL0Cj
OBq3QK0azLNUDoJuXgIhAM7G76536F2UpdwD5QmnGnH7KMTvyHWLNMO9kEv279w2
-----END CERTIFICATE-----
 2 s:CN = ziti-controller-edge-root
   i:CN = ziti-controller-edge-root
-----BEGIN CERTIFICATE-----
MIIBijCCAS+gAwIBAgIRAPsjaayJllKSo6vmc9QVQWUwCgYIKoZIzj0EAwIwJDEi
MCAGA1UEAxMZeml0aS1jb250cm9sbGVyLWVkZ2Utcm9vdDAeFw0yMzAzMjQxODM0
MjNaFw0zMzAzMzExODM0MjNaMCQxIjAgBgNVBAMTGXppdGktY29udHJvbGxlci1l
ZGdlLXJvb3QwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATF6QrP8KknGyLNRhd1
u5/X0gJ/nAov2AcfcjWkTntM8XA50Jrx1MCGb2u35c9fzOWKru0lZQLYJFHY6UUs
5z2fo0IwQDAOBgNVHQ8BAf8EBAMCAYYwDwYDVR0TAQH/BAUwAwEB/zAdBgNVHQ4E
FgQUQXznmONUZjomldZ/St3tRgI1P50wCgYIKoZIzj0EAwIDSQAwRgIhANXKKmw1
gbMXSaS3J3jkg8rrMCx9TYFJXqngX8lKyjXgAiEA5VriauLLSXi8PpJGfRln0bdk
Kz7FFT0QTOFTLaPaYcA=
-----END CERTIFICATE-----
---
Server certificate
subject=C = , ST = , L = , O = , OU = , CN = Ff8oRvyqtj

issuer=CN = ziti-controller-edge-signer

---
No client certificate CA names sent
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 2657 bytes and written 427 bytes
Verification error: self signed certificate in certificate chain
---
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
No ALPN negotiated
Early data was not sent
Verify return code: 19 (self signed certificate in certificate chain)
---
140539129076552:error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate:ssl/record/rec_layer_s3.c:1543:SSL alert number 42

Doesnโ€™t look all bad, huh?

Also please note: I am successfully connected with a ziti desktop edge to the network and I am actually being able to consume services. โ€ฆ

The openssl command is what I wanted to run. If you want to DM me the actual name of the public router you can.

I wanted to see the X509v3 Subject Alternative Name from the final leaf cert and verify that it's "router-edge.sdn.my.org". I usually use this command for that:

openssl s_client -connect  router.clint.demo.openziti.org:8442 | openssl x509 -text | grep Alter -a1

You'll see it returns:

            X509v3 Subject Alternative Name:
                DNS:*.clint.demo.openziti.org

I expect that the first replacement you did ziti-router.kube.my.org is not actually different than router-edge.sdn.my.org, right?

I donโ€™t see how the problem could be in the orchestration of starting up the router inside k8s

I think it's worthwhile to still go through with connecting a manually provisioned router. I know you can connect your edge client but the PKI that get used for fabric links, is not necessarily the same as the PKI used for edge connections so it is possibly different.

...
...
.... so I wrote all that then looked at your results one more time and i think mabye it DOES look wrong?
...
...

I looked at openssl s_client -showcerts -connect router-edge.sdn.my.org:443 one more time and I noticed that the chain returned sure looks like it is for the CONTROLLER and not the router, right?

that openssl s_client should be returning a chain that is from the router I think? The SANS would show us that for sure though, so run that openssl command and get the X509v3 Subject Alternative Name section?

Hi and thanks for digging deep with me :slight_smile:

This is a nice way to see whether the right component answers, also, yes!

openssl s_client -connect   router-edge.sdn.my.org:443 | openssl x509 -text | grep Alter -a1

gives me:

            X509v3 Subject Alternative Name:
                DNS:localhost, DNS:router-edge.sdn.my.org, IP Address:127.0.0.1

Again, domain name changed, but it's answering with the right one, I can confirm!

It actually is a different name, only accessible from internal network. This is the config of the private router, which I would like to connect edges to later on (maybe) but it should not be connected from other routers to form the router fabric, as it is not possible to connect from the public router to this private one.

OK, I will very probably try this tomorrow (today. :))...

This is strange. As the openssl above showed the right name. ... but yes, it says "controller" there a lot of times. .... Maybe we DO have a problem with the helm chart here, but not with my private but with the public router. ... but still: why can I consume services using my desktop edge then. ...?

Yeah everything seems like it checks out properly. It definitely makes me think that the โ€œpublicโ€ cluster is indeed setup correctly. The steps you outline are exactly what I would do, the one small thing I personally donโ€™t do is use --no-traversal so, itโ€™s possible thatโ€™s having an effect that isnโ€™t obvious to me. Iโ€™d be interested in your โ€œnon-kubernetesโ€ test and if that has the same results, maybe removing the --no-traversal flag just to test if thatโ€™s somehow causing an issue.

Keeping it out of kubernetes just reduces some possible areas of complexityโ€ฆ For example when the router enrolls, it needs to write the PKI to the locations specified in the topmost identity section of the config. If one or more of those locations donโ€™t persist properly, it could cause problems. Doing it all locally would just eliminate a few of those type of variables.

Once you connect an edge router without going through kubernetes automation, thatโ€™ll give us more information to go by.

What you have done, seems like it should be ok. Thatโ€™s why Iโ€™m asking the โ€œdumbโ€ type questions, since it seems like you did it right to meโ€ฆ

Thanks, let me know how the โ€œlocalโ€ router install goes.

Good morning!

I have done a quick setup of a router on my notebook. I have used version 0.27.5 as this is the version I am using in Kubernetes (pull request to upgrade to 0.27.7 is still open), and then I tried with 0.27.7. Result is the same on both.

Long story short: same problem:

./ziti-router run router.yaml
[   0.022]    INFO ziti/ziti/router.run: {revision=[3d9801e73809] go-version=[go1.19.5] os=[darwin] configFile=[router.yaml] build-date=[2023-02-13T21:41:19Z] arch=[amd64] routerId=[55Ls.Wy4PX] version=[v0.27.5]} starting ziti-router
[   0.025]    INFO fabric/router/forwarder.(*Scanner).run: started
[   0.025]    INFO fabric/router/forwarder.(*Faulter).run: started
[   0.027]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {maxQueueSize=[1000] minWorkers=[0] maxWorkers=[32] idleTime=[30s] poolType=[pool.link.dialer]} starting goroutine pool
[   0.027]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {idleTime=[30s] maxQueueSize=[1000] minWorkers=[0] maxWorkers=[128] poolType=[pool.route.handler]} starting goroutine pool
[   0.028] WARNING edge/router/internal/edgerouter.(*Config).LoadConfigFromMap: Invalid heartbeat interval [0] (min: 60, max: 10), setting to default [60]
[   0.030] WARNING edge/router/internal/edgerouter.parseEdgeListenerOptions: port in [listeners[0].options.advertise] must equal port in [listeners[0].address] for edge binding but did not. Got [443] [3022]
[   0.033]    INFO fabric/router.(*Router).initializeCtrlEndpoints: controller endpoints file [endpoints] doesn't exist. Using initial endpoints from config
[   0.034]    INFO fabric/router.(*Router).showOptions: ctrl = {"OutQueueSize":4,"MaxQueuedConnects":1,"MaxOutstandingConnects":16,"ConnectTimeout":1000000000,"DelayRxStart":false,"WriteTimeout":0}
[   0.034]    INFO fabric/router.(*Router).showOptions: metrics = {"ReportInterval":60000000000,"MessageQueueSize":10}
[   0.034]    INFO fabric/router.(*Router).initializeHealthChecks: starting health check with ctrl ping initially after 15s, then every 30s, timing out after 15s
[   0.035]    INFO fabric/router.(*Router).startXlinkDialers: started Xlink dialer with binding [transport]
[   0.037]    INFO edge/router/xgress_edge.(*listener).Listen: {address=[tls:0.0.0.0:3022]} starting channel listener
[   0.037]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {poolType=[pool.listener.xgress_edge] idleTime=[10s] maxQueueSize=[1] minWorkers=[1] maxWorkers=[16]} starting goroutine pool
[   0.039]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [edge] at [tls:0.0.0.0:3022]
[   0.039]    INFO edge/router/xgress_edge.(*Acceptor).Run: starting
[   0.039]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.040]    INFO fabric/router.(*Router).startControlPlane: router configured with 1 controller endpoints
[   0.040]    INFO fabric/router.(*Router).startControlPlane: connecting to controller at endpoing [tls:ctrlplane.sdn.my.org:443]
[   0.189]    INFO edge/router/fabric.(*StateManagerImpl).StartHeartbeat: heartbeat starting
[   0.190]    INFO edge/router/xgress_edge_tunnel.(*tunneler).Start: {mode=[host]} creating interceptor
[   0.190]    INFO edge/router/xgress_edge.(*CertExpirationChecker).Run: waiting 8615h59m16.577852s to renew certificates
[   0.195]    INFO edge/router/handler_edge_ctrl.(*helloHandler).HandleReceive.func1: received server hello, replying
[   0.198] WARNING edge/tunnel/dns.flushDnsCaches: {error=[exec: "resolvectl": executable file not found in $PATH]} unable to find systemd-resolve or resolvectl in path, consider adding a dns flush to your restart process
[   0.229]    INFO edge/router/handler_edge_ctrl.(*apiSessionAddedHandler).instantSync: {strategy=[instant]} first api session syncId [clfz6n2qjfu0e018q2c9pfinf], starting
[   0.230]    INFO edge/router/handler_edge_ctrl.(*apiSessionSyncTracker).Add: received api session sync chunk 0, isLast=true
[   0.586]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {routerId=[Ff8oRvyqtj] address=[tls:router-edge.sdn.my.org:443] linkProtocol=[tls] routerVersion=[v0.27.5] linkId=[7LnlaVm2x8RcecYJumdNWx]} dialing link
[   0.838]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {routerVersion=[v0.27.5] linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj]} link destination support heartbeats
[   0.838]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/7LnlaVm2x8RcecYJumdNWx}->u{classic}->i{rNmM}]: {routerId=[Ff8oRvyqtj] linkId=[7LnlaVm2x8RcecYJumdNWx]} link closed
[   1.223]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[   1.224]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/7LnlaVm2x8RcecYJumdNWx]
[   1.224]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[7LnlaVm2x8RcecYJumdNWx] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[   1.224]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/7LnlaVm2x8RcecYJumdNWx}->u{classic}->i{XzKQ}]: {linkId=[7LnlaVm2x8RcecYJumdNWx] routerId=[Ff8oRvyqtj]} link closed
[   1.231]    INFO edge/router/handler_edge_ctrl.(*apiSessionAddedHandler).applySync: finished sychronizing api sessions [count: 5, syncId: clfz6n2qjfu0e018q2c9pfinf, duration: 201.709ยตs]
[   2.028]    INFO edge/tunnel/intercept.SetDnsInterceptIpRange: dns intercept IP range: 100.64.0.1 - 100.127.255.254


[  60.865]    INFO fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {address=[tls:router-edge.sdn.my.org:443] linkProtocol=[tls] routerVersion=[v0.27.5] linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj]} dialing link
[  60.981]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[  60.982]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/70VYEyDxMla3SExcN8kBCF}->u{classic}->i{dnR4}]: {routerId=[Ff8oRvyqtj] linkId=[70VYEyDxMla3SExcN8kBCF]} link closed
[  61.093]    INFO fabric/router/handler_link.(*bindHandler).BindChannel: {linkId=[70VYEyDxMla3SExcN8kBCF] routerId=[Ff8oRvyqtj] routerVersion=[v0.27.5]} link destination support heartbeats
[  61.093]    INFO fabric/router.(*xlinkAccepter).Accept: accepted new link [l/70VYEyDxMla3SExcN8kBCF]
[  61.093]    INFO fabric/router.(*linkRegistryImpl).applyLink: {linkProtocol=[tls] newLinkId=[70VYEyDxMla3SExcN8kBCF] dest=[Ff8oRvyqtj]} link being registered, but is already closed, skipping registration
[  61.094]    INFO fabric/router/handler_link.(*closeHandler).HandleClose [ch{l/70VYEyDxMla

There MUST be something wrong with the certificate chain on the public router, something wrong with the way the helm chart builds thatโ€ฆ I still wonder why another ROUTER canโ€™t connect to my public router but a desktop edge client can. โ€ฆ shouldnโ€™t that be the same mechanism behind?

I did another test with my still running ziti setup where I have published all necessary ports to the outside without Ingress Controller: I use โ€œconnect.my.orgโ€ with different ports and port forwarding there: 1290 would be the edge-router port.

I see two differences, one is that the SANs include the term โ€œcore-routerโ€ but this could be because of the different way the helm charts do the setup (We have set up this installation together with @marvkis using his helm charts that have been merged into the original ones but with changes), but worth to mention anyway:

๏ฃฟ ~/tmp/ziti-router-test/ openssl s_client -connect connect.my.org:1290 | openssl x509 -text | grep Alter -a1
depth=2 CN = ziti-signing-root-ca
verify error:num=19:self signed certificate in certificate chain
verify return:0

            X509v3 Subject Alternative Name:
                DNS:localhost, DNS:connect.my.org, DNS:core-router, IP Address:127.0.0.1
8268038464:error:1404C412:SSL routines:ST_OK:sslv3 alert bad certificate:/AppleInternal/Library/BuildRoots/9e200cfa-7d96-11ed-886f-a23c4f261b56/Library/Caches/com.apple.xbs/Sources/libressl/libressl-3.3/ssl/tls13_lib.c:129:SSL alert number 42

โ€ฆand if I take the ladder part out of that command to see the certificate chain, I clearly see that it looks different from the one the new public router presents:

๏ฃฟ ~/tmp/ziti-router-test/ openssl s_client -connect connect.my.org:1290
CONNECTED(00000005)
depth=2 CN = ziti-signing-root-ca
verify error:num=19:self signed certificate in certificate chain
verify return:0
write W BLOCK
---
Certificate chain
 0 s:/C=/ST=/L=/O=/OU=/CN=0Sa9Z5iuj
   i:/CN=ziti-signing-intermediate-ca
 1 s:/CN=ziti-signing-intermediate-ca
   i:/CN=ziti-signing-root-ca
 2 s:/CN=ziti-signing-root-ca
   i:/CN=ziti-signing-root-ca
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDujCCA2CgAwIBAgIDAnEFMAoGCCqGSM49BAMCMCcxJTAjBgNVBAMTHHppdGkt
c2lnbmluZy1pbnRlcm1lZGlhdGUtY2EwHhcNMjMwMTI1MTgyNTU2WhcNMjQwMTI1
MTgyNjU2WjBLMQkwBwYDVQQGEwAxCTAHBgNVBAgTADEJMAcGA1UEBxMAMQkwBwYD
VQQKEwAxCTAHBgNVBAsTADESMBAGA1UEAxMJMFNhOVo1aXVqMIICIjANBgkqhkiG
9w0BAQEFAAOCAg8AMIICCgKCAgEA6RvoFBGmrqflQIML6PGhRsNv/hfjyvJeoJYy
F/5n6oDJiPmsZOKYwSnPQpXMKgL1hxZC7AiN4+e4DyALB49Oid1CwfEcrBpkSzhc
srcZMW2/QzWhPiMHpZU2A67lCGp3mbMxRt1wwgJVxxI+52muaCGwnueSUMZd3U9M
melVs4WER5aehZfN3ZPKVlNeCxweS+nom0oBEG+Hk6f6t1FySFOoZv5vYYamn7TX
CT4RCkFrfpRuKS+7shUTmHkwk0s2TcTtSFIkURnbORWndPzyDvwvJhtEj0cqIHUc
icRX8tdW4Eq7UHisCpmoTsvRLfNEHm5zlE4Ph3WrBYG6+rv5NbyB+hkJTAbjgIW7
6Qq3Oe06C0E+h1N+8iVQ1wcN3Uh+BaQ7o8H1j8zCyk1hqXipNsCsMJmIpROpJeTY
4ZdDSWO+rdMqXK9ECJlBwdx48iPCY/5jfxxZEfmD0gopv7e7P57rB5NiF3NNI4pn
gD+awj09rhIJMTvD/0Is5nDAv/7y0Sm814/TiDtOhyrpcCYiF/xIh1mk21j1w3OY
v17FCUyVTPXQDDUBFJuepUn1qkXyVUy7EGM+7ceUNWh0qLAUa52mthdOWR8VadsL
pJMRsf25Q9HE1TZp4O6PCGa0MQgr9oKHiCwtjw9Ueqf8NeGjmEbQNOPiL57orqwm
2d4ItikCAwEAAaOBizCBiDAOBgNVHQ8BAf8EBAMCBLAwEwYDVR0lBAwwCgYIKwYB
BQUHAwEwHwYDVR0jBBgwFoAU7qMQxAZ6Qpj5EdgUsNtwz+tqqg4wQAYDVR0RBDkw
N4IJbG9jYWxob3N0ghdjb25uZWN0LmVudGh1cy5zZXJ2aWNlc4ILY29yZS1yb3V0
ZXKHBH8AAAEwCgYIKoZIzj0EAwIDSAAwRQIhAM7fSwFeZ+1MDCGngVRWvNkQuJON
YstsTwvKgPwGS74WAiAyRruaXdX+odmJuwjf/08F9Cq68NuHnOP37EqrwlEdlw==
-----END CERTIFICATE-----
subject=/C=/ST=/L=/O=/OU=/CN=0Sa9Z5iuj
issuer=/CN=ziti-signing-intermediate-ca
---
No client certificate CA names sent
Server Temp Key: ECDH, X25519, 253 bits
---
SSL handshake has read 2647 bytes and written 381 bytes
---
New, TLSv1/SSLv3, Cipher is AEAD-CHACHA20-POLY1305-SHA256
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : AEAD-CHACHA20-POLY1305-SHA256
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    Start Time: 1680427398
    Timeout   : 7200 (sec)
    Verify return code: 19 (self signed certificate in certificate chain)
---

I will have a deeper look later.

I have digged a bit into the public core router setup. I have noticed that 8 days ago the โ€œpost-install-jobโ€ apparently did fail. Unfortunately, I donโ€™t see what exactly went wrong, because the pod created from this job is already gone. This post-install-job is the following script:

#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
set -o xtrace

if kubectl -n openziti get secret \
  ziti-core-router-identity &>/dev/null; then
  if [[ ${HELM_UPGRADE:-} == true ]]; then
	echo 'INFO: no-op because secret exists and is Helm upgrade'
	# exit without error so Helm will delete the post-upgrade hook Job
	exit 0
  else
	echo 'ERROR: secret exists: "ziti-core-router-identity"' >&2
	# this should never happen because Helm deletes the secret with pre-uninstall hook
	exit 1
  fi
else
  echo "INFO: identity secret does not exist, attempting router enrollment"
fi

mkdir -v ${ZITI_ROUTER_IDENTITY_DIR}

ziti router enroll \
  /etc/ziti/config/ziti-router.yaml \
  --jwt /etc/ziti/config/enrollment.jwt \
  --verbose

kubectl -n openziti create secret generic \
  ziti-core-router-identity \
  --from-file=client.crt="${ZITI_ROUTER_IDENTITY_DIR}/client.crt" \
  --from-file=tls.key="${ZITI_ROUTER_IDENTITY_DIR}/tls.key" \
  --from-file=tls.crt="${ZITI_ROUTER_IDENTITY_DIR}/tls.crt" \
  --from-file=ca.crt="${ZITI_ROUTER_IDENTITY_DIR}/ca.crt"

So it MAY be something went wrong with the initial installation of the public router causing that it delivers the wrong certificate structure somehow?

BUT, the router works and communicates fine with the controller, and as said, edge tunnelers work fine also.

The certificates all exist. But I am not 100% sure which of them is used for what, exactly. I could dig deeper into these with openssl, but I lack the knowledge of what to expect.

The failed job could on the other hand have happened because of any other circumstance. The fact that I donโ€™t see the belonging pod and its logs anymore, makes that a dead end road to debug.

It's a bit complicated. They aren't exactly the same mechanism. Routers connecting to one another are related to the overlay network itself. The clients connecting is "edge" functionality and while each of them is the same insofar as mTLS, they are different PKI chains.

The last step of the script is the one that has me wondering. When a router enrolls those four files are written starting with the private key (tls.key). The ca is fetched from the controller (ca.crt), and two certificates are then made for client authentication and one for server auth.

It's definitely difficult to troubleshoot this. I'm wondering if the router pod stopped/was deleted perhaps? Maybe there's a bug around that side of things. I'll have a chat with @qrkourier tomorrow as he's a bit closer to kubernetes than I am and I think I'll have to talk to someone about the edge router pki too so I don't think we'll get too far today.

Your link listener is advertising the same port as your edge listener. So the link dial is reaching your edge listener instead of the link listener.

Cheers
Paul

Woah eagle eyes! Yeah thatโ€™d definitely explain why edge clients are working while links are failing. Nice catch Paul

Thanks @plorenz and @TheLumberjack. I believe exactly there is the mistake. Today I had @marvkis looking over this thread with whom I did the initial setup and he said basically the same thing.

The culprit was this part in the public routerโ€™s config

link:
  dialers:
    - binding: transport
  listeners:
    - binding:          transport
      bind:             tls:0.0.0.0:10080
      advertise:        tls:router-edge.sdn.my.org:443
      options:
        outQueueSize:   4

which was originated by the fact that we just did not take care about the linkListeners.transport part of the values.yaml, which seems to then do a fallback of using the โ€œglobalโ€ advertisedHost for both edge and transport. Kinda makes sense, but not here :slight_smile:

With this in values controlling the helm-chart, another nice Ingress is being created and my test router tries to connect to the right URL:

linkListeners:
  transport:
    advertisedHost: router-transport.sdn.my.org
    advertisedPort: 443
    ingress:
      enabled: true
      ingressClassName: nginx
      annotations:
        kubernetes.io/ingress.allow-http: "false"
        nginx.ingress.kubernetes.io/ssl-passthrough: "true"
        nginx.ingress.kubernetes.io/secure-backends: "true"

Unfortunately, the privat router now still doesnโ€™t connect, but this time for a good reason:

[ 482.038]   ERROR fabric/router/handler_ctrl.(*dialHandler).handle |link, linkDialer|: {error=[error dialing outgoing link [l/3MbRdV8odRie7RMdIYLbec]: error dialing payload channel for [l/3MbRdV8odRie7RMdIYLbec]: tls: failed to verify certificate: x509: certificate is valid for localhost, router-edge.sdn.my.org, not router-transport.sdn.my.org] linkId=[3MbRdV8odRie7RMdIYLbec] routerId=[Ff8oRvyqtj] address=[tls:router-transport.sdn.my.org:443] linkProtocol=[tls] routerVersion=[v0.27.5]} link dialing failed

Sure, the certificate auto-generated at creation time of the public router with the wrong host name is still containing the wrong host name. So, my next step would be to redeploy the public router and then see what happens.

Keep you posted!

So, as said in the last post, this was the problem!

With the new values.yaml containing an explicit advertisement host and port and ingress creation enabled for the router transport link, after uninstalling the public router, deleting, re-adding it in openziti controller via cli and re-installing it with a new JWT the certificates and their names are fine and another private router can connect to it without any problem. Just works!

Special thanks to @TheLumberjack who has supported me very much during this thread, and thanks to @plorenz and @marvkis for seeing what the problem really was.

Weekend is rescued!

2 Likes

Happy to hear you got everything sorted! Cheers