Routers intermittently lose controller connection (unable to ping, EOF, reconnects after 2–4 minutes) – OpenZiti 1.1.15 on GKE behind nginx ingress

Environment

OpenZiti version

  • Controller: 1.1.15

  • Routers: 1.1.15

Platform

  • Google Kubernetes Engine (GKE)

  • Node type: e2-standard-2

  • Controller and router are running as separate pods (same node in some cases)

Scale

  • ~200 identities

  • Multiple services and terminators

Problem Summary

We are seeing intermittent control-plane disconnections between routers and the controller.

At the time of the incident:

  • routers lose the controller connection

  • they reconnect automatically after ~1–2 minutes

  • during this window, clients experience service failures and the routers log:

service <id> has no online terminators

and

unable to ping (use of closed network connection)
rx error. closed peer and starting reconnection process
EOF
{"_context":"ch{edge}-\u003eu{classic}-\u003ei{QEmR}","chSeq":23175,"connId":126,"edgeSeq":0,"error":"service 3Zwpeo9QSKNGDkiqYLI6MX has no online terminators for instanceId ","file":"github.com/openziti/ziti/router/xgress_edge/listener.go:199","func":"github.com/openziti/ziti/router/xgress_edge.(*edgeClientConn).processConnect","level":"warning","msg":"failed to dial fabric","time":"2026-01-30T14:53:00.339Z","token":"2f6ccfa7-102d-4d57-86e4-3efff6910def","type":"EdgeConnectType"}

This happens even though:

  • CPU usage is low

  • memory usage is low

  • node conntrack usage is low

  • node file descriptor usage is low

We can provide Grafana metrics if needed.


Observed Router Logs

Example log lines from router:

_context":"u{reconnecting}->i{KMeD} @tls:ziti-ctrl.zzz.zzz:443
msg":"unable to ping (use of closed network connection)"

msg":"rx error. closed peer and starting reconnection process"
error":"EOF"

After this, routers reconnect and services recover.

At the same time, routers log:

failed to dial fabric
service <service-id> has no online terminators
{"_context":"ch{edge}-\u003eu{classic}-\u003ei{QEmR}","chSeq":23175,"connId":126,"edgeSeq":0,"error":"service 3Zwpeo9QSKNGDkiqYLI6MX has no online terminators for instanceId ","file":"github.com/openziti/ziti/router/xgress_edge/listener.go:199","func":"github.com/openziti/ziti/router/xgress_edge.(*edgeClientConn).processConnect","level":"warning","msg":"failed to dial fabric","time":"2026-01-30T14:53:00.339Z","token":"2f6ccfa7-102d-4d57-86e4-3efff6910def","type":"EdgeConnectType"}

Controller Connectivity

Routers connect to the controller using:

tls:ziti-ctrl.zzz.zzz:443

Internally the controller listens on:

tls:0.0.0.0:1280


Helm Values Used for Controller

# /tmp/controller-values.yml

ctrlPlane:
  advertisedHost: ziti-ctrl.zzz.zzz
  advertisedPort: 443
  service:
    type: ClusterIP
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      kubernetes.io/ingress.allow-http: "false"
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
      nginx.ingress.kubernetes.io/secure-backends: "true"

clientApi:
  advertisedHost: ziti-controller.zzzz.zzz
  advertisedPort: 443
  service:
    type: ClusterIP
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      kubernetes.io/ingress.allow-http: "false"
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
      nginx.ingress.kubernetes.io/secure-backends: "true"

Ingress is backed by nginx ingress controller with SSL passthrough.


Important Observations

  • The router is not crashing.

  • The controller is not crashing.

  • The connection is being closed and re-established.

  • Multiple services lose terminators at the same time.

  • The issue is reproducible and visible in metrics and logs.


Questions / Clarification Requested

  1. Could running controller and router on the same node (but different pods) contribute to this behavior, or is that unrelated?

  2. Are there any known issues or recommended settings for running OpenZiti 1.1.15 control-plane traffic behind an ingress?

  3. Do you recommedn to run controller and roiuter on seperate node not on same node and use little higher configuration? does e2-standard-2 is recommeded?


Additional Context

  • We are running ~200 identities.

  • We do not observe CPU, memory, conntrack, or file descriptor pressure at the node level.

  • We can provide Grafana screenshots and Kubernetes events if needed.

Is it ok to put Ctrl plane nginx with ssl passthrough does it terminate connection randomly?