OpenZiti 1.8.0-pre5 Control Channel Stuck in "reconnecting" State - Regression Bug
Critical Issue Summary
After upgrading from 1.6.12 to 1.8.0-pre5, router-controller control channel is permanently stuck in "reconnecting" state despite correct network configuration, causing all circuit creation to fail with timeouts.
This worked perfectly in 1.6.12 - this is a critical regression bug in 1.8.0-pre5.
Environment
- Previous Version: 1.6.12 (working correctly)
- Current Version: 1.8.0-pre5 (broken)
- Controller: openziti/ziti-controller:1.8.0-pre5
- Router: openziti/ziti-router:1.8.0-pre5
- Client SDK: ziti-sdk-c
- Deployment: Azure VMs, Docker containers
- Architecture: Single controller + single router on same Docker network
- Platforms Affected: macOS and Windows clients
The Bug: Control Channel Stuck in "reconnecting"
Router logs show control channel permanently in "reconnecting" state:
{
"context": "ch{ctrl}->u{reconnecting}->i{NetFoundry Inc. Client s3n-fHzTu/nB58}",
"error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded",
"file": "github.com/openziti/ziti/router/handler_ctrl/route.go:139",
"msg": "failure while handling route update"
}
Key observation: Every single route request shows u{reconnecting} in context, never showing a stable connection state.
Frequency: 14+ occurrences per hour, affecting every circuit creation attempt.
Impact: Mass Circuit Creation Failures
217 circuit creation failures in 6 hours - 100% failure rate:
Controller Logs
{
"_channels": ["selectPath"],
"apiSessionId": "c6e8c16a-9189-498a-af23-7b81032618db",
"attemptNumber": 2,
"circuitId": "5Q9Vb5lZA2u6J7L29W2mos",
"file": "github.com/openziti/ziti/controller/network/network.go:674",
"level": "warning",
"msg": "circuit creation failed after [2] attempts, sending cleanup unroutes",
"serviceId": "4Ars44EiIHcFOO8t7GVAgQ",
"serviceName": "Elitecom-Openziti-Web",
"sessionId": "cmk6ngwhk2te001pc4olkcz6f"
}
{
"attemptNumber": 1,
"circuitId": "1VBo17d5yR88FkQsi8PgzU",
"file": "github.com/openziti/ziti/controller/network/routesender.go:197",
"level": "warning",
"msg": "received failed route status from [r/8SGqhnW74C] for attempt [#0]",
"error": "(error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded)"
}
Router Logs
{
"_channels": ["establishPath"],
"apiSessionId": "c6e8c16a-9189-498a-af23-7b81032618db",
"attempt": 0,
"attemptNumber": "1",
"binding": "edge",
"circuitId": "1VBo17d5yR88FkQsi8PgzU",
"context": "ch{ctrl}->u{reconnecting}->i{NetFoundry Inc. Client s3n-fHzTu/nB58}",
"destination": "5gKN8j9cBnjjNS5ICeC2IM",
"error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded",
"file": "github.com/openziti/ziti/router/handler_ctrl/route.go:139",
"level": "error",
"msg": "failure while handling route update"
}
Terminator Cleanup
{
"file": "github.com/openziti/ziti/router/xgress_edge/hosted.go:473",
"level": "info",
"msg": "terminator removed from router set",
"reason": "received error from controller: timeout waiting for message reply: context deadline exceeded",
"terminatorId": "<id>"
}
Client-Side Symptoms
ERROR connect_reply_cb() conn<service> failed to connect, reason=invalid session
DEBUG complete_conn_req() conn<service> Disconnected failed: connection is closed
Configuration Details
Router Configuration:
ctrl:
endpoint: tls:openziti.itglobal.dk:443
Docker Network:
Router and controller on same Docker network (ziti-controller_ziti)
Internal connectivity verified
No network isolation issues
Session Configuration:
api_session_enforcer:
frequency: 5s
sessionTimeout: 30m0s
Router Cache:
listeners:
- binding: edge
options:
getSessionTimeout: 60
Database: Successfully migrated to version 44
Resources:
- Router: 5.43% CPU, 97MB RAM
- Controller: 19.31% CPU, 139MB RAM
Timing Pattern (Client Perspective)
Clients observe:
- Quick reconnect (< 60 sec):
Sometimes works (hits timing window)
- Delayed reconnect (> 60 sec):
Always fails
This suggests router's API session cache (60s timeout) occasionally masks the underlying control channel issue, but once cache expires, the control channel problem becomes visible.
What We've Verified
Network connectivity: Router and controller on same Docker network
TLS certificates: Valid, no handshake failures
Resources: CPU and memory normal
Client tokens: Fresh tokens after restart
Database: Migration successful to v44
Configuration: Controller endpoint correct
No restarts: Controller and router are stable, no crashes
The Regression
Version 1.6.12 Behavior (Working)
- Control channel stable
- Circuit creation: 100% success rate
- Clients reconnect automatically after server restart
- No "reconnecting" state observed
- No timeout errors
Version 1.8.0-pre5 Behavior (Broken)
- Control channel permanently in "reconnecting" state
- Circuit creation: 0% success rate
- Clients cannot establish connections
- Persistent "timeout waiting for message reply" errors
- Requires manual intervention (client restart)
Critical Questions for OpenZiti Team
1. Control Channel State Machine Bug
Question: Why is the control channel stuck in "reconnecting" state when network connectivity is verified and stable?
Code locations:
github.com/openziti/ziti/router/handler_ctrl/route.go:139 - where timeouts occur
- Control channel context shows:
u{reconnecting} instead of connected state
Hypothesis: Bug in control channel reconnection logic in 1.8.0-pre5 prevents successful connection establishment or keeps state machine stuck in reconnecting loop.
2. Message Reply Timeout
Question: What is the expected timeout for "waiting for message reply" during route creation?
Observation: Every single route request times out with "context deadline exceeded"
Code location: github.com/openziti/ziti/controller/network/routesender.go:197
3. Changes Between Versions
Question: What changed in control channel management between 1.6.12 and 1.8.0-pre5?
Specific areas of interest:
- Control channel connection/reconnection logic
- Message timeout handling
- Route creation protocol
- Circuit establishment flow
4. Known Issues
Question: Is this a known issue in 1.8.0-pre5? Are there any:
- Patches available (pre6, pre7, etc.)?
- Configuration workarounds?
- Debug logging we should enable?
5. Diagnostic Data Needed
Question: What additional information would help diagnose this?
We can provide:
- Full router logs with circuit failures
- Full controller logs
- Router/controller debug level logs
- Network packet captures
- Any other diagnostics requested
Request
This appears to be a critical regression bug in 1.8.0-pre5's control channel implementation.
Additional Context: We have TWO separate OpenZiti deployments (different Azure VMs), both experiencing identical symptoms after upgrading to 1.8.0-pre5. Both worked flawlessly on 1.6.12. This confirms it's not an environment-specific configuration issue but a regression in 1.8.0-pre5.