TLS Handshake Failed on Ziti Router Pods

Issue:

I am seeing repeated handshake failed errors in my Ziti router pods when checking their logs. The routers seem to be running fine, and all services I created are reachable, but these errors persist.

Environment:

  • Platform: AWS EKS
  • Helm Chart: Ziti Router
  • StorageClass: EBS
  • Network: Internal routers with ClusterIP, public edge access via LoadBalancer
  • Ingress: Disabled
  • Persistent Storage: Enabled

Error Logs:

{"_context":"tls:0.0.0.0:3022","error":"EOF","file":"github.com/openziti/transport/v2@v2.0.153/tls/listener.go:257","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"172.25.181.109:17981","time":"2025-02-07T22:54:39.274Z"}
{"_context":"tls:0.0.0.0:3022","error":"EOF","file":"github.com/openziti/transport/v2@v2.0.153/tls/listener.go:257","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"172.25.89.248:49070","time":"2025-02-07T22:54:39.375Z"}
{"_context":"tls:0.0.0.0:3022","error":"EOF","file":"github.com/openziti/transport/v2@v2.0.153/tls/listener.go:257","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"172.25.158.72:40193","time":"2025-02-07T22:54:39.742Z"}
{"_context":"tls:0.0.0.0:3022","error":"EOF","file":"github.com/openziti/transport/v2@v2.0.153/tls/listener.go:257","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"172.25.107.42:60677","time":"2025-02-07T22:54:41.017Z"}

Router Values.yaml Configuration:

ctrl:
  endpoint: ziti-controller.example.com:443
  advertisedHost: ziti-router-public-1.example.com

edge:
  advertisedHost: ziti-router-public-1.example.com
  advertisedPort: 443
  service:
    type: LoadBalancer  
    annotations:
      external-dns.alpha.kubernetes.io/hostname: ziti-router-public-1.example.com
      service.beta.kubernetes.io/aws-load-balancer-internal: "false"
      service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-0158c2bd5d277c65b"
      service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true"
  ingress:
    enabled: false

linkListeners:
  transport:
    advertisedHost: ziti-router-public-release-1-transport.ziti-router.svc.cluster.local
    advertisedPort: 443
    service:
      enabled: true
      type: ClusterIP
    ingress:
      enabled: false 

image:
  additionalArgs:
    - '--extend'

persistence:
  enabled: true
  accessMode: ReadWriteOnce
  size: 1Gi
  storageClass: ebs-sc

What I've Checked So Far:

:white_check_mark: All created services are reachable.
:white_check_mark: The routers are running without crashes.
:white_check_mark: Certificates should be valid as they were generated correctly.

Questions:

  1. What could be causing these TLS handshake failures?
  2. Are these errors expected behavior, or do they indicate a misconfiguration?
  3. Could this be due to mismatched certificates or an issue with the advertised hosts?
  4. Any debugging tips for identifying which service is attempting the failed handshake?

Would appreciate any insights from the community! Thanks in advance!

In my experience this is always because I've recreated my network or removed an identity leading to an identity that can't connect because of actions I took. When that happens the identity attempts to connect periodically. You would need to use tcpmon/wireshark to find where it's coming from. It gets hard if you have multiple devices behind the same ip though.

Hopefully that helps. So yeah, expected.

Here's a GitHub issue where I've raised concerns about the log level at which controllers and routers emit these messages.

In short, the client handshake errors are not necessarily errors because they're emitted under normal circumstances such as ziti edge login with a password, Kubernetes liveness and readiness probes against the health check endpoints, etc.


The log you shared reveals the TLS listener that observed a handshake attempt that "failed." I'm assuming that refers only to client TLS failure, not server TLS failure, which occurred in the router's transport listener. I'm guessing it relates to the configuration of "link listeners," not "edge listeners." However, I'm not confident, and it could be either. The router only has those two TLS listener configurations, bound to the same TCP port by default, 3022/TCP.

Your best clues are the client IP addresses shown in the log. They're all within 172.25.0.0/16, suggesting a pod CIDR or other private subnet attached to your cluster.