Environment:
- Controller: openziti/ziti-controller:2.0.0-pre6 (Docker)
- Router: openziti/ziti-router:2.0.0-pre6 (Docker)
- SDK clients: ziti-edge-tunnel v1.11.1 (Windows, latest stable) + ziti-sdk-c on Linux hosts
- Deployment: Single controller, single router, ~700 services/terminators
Problem:
After upgrading from v2.0.0-pre3 to v2.0.0-pre6, all terminators (~700) are created successfully on router startup, but are systematically deleted
approximately 12 minutes later, leaving only ~23 terminators. This cycle repeats on every router restart.
Root cause analysis:
Through log analysis, we identified the following sequence:
- Router starts and creates ~730 terminators as expected
- Router sends post-create inspect requests (message type 60799) to SDK clients for each terminator
- SDK clients (ziti-edge-tunnel v1.11.1 / ziti-sdk-c) advertise SupportsInspectHeader=true in their bind requests, but do not understand or handle message
type 60799 - Router logs confirm the SDK clients drop the messages: "dropped message. type [60799]"
- After ~10 minutes + 3 retries (hardcoded in hostedServiceRegistry.evaluatePostCreateInspects), the router marks all terminators as state=3 (deleting) with
reason: "post-create inspect timed out" - Router batch-deletes all terminators via evaluateDeleteQueue / RemoveTerminatorsRateLimited
- Result: ~700 terminators deleted, services become unreachable
Relevant log entries:
Router (inspect timeout triggering deletion):
"msg":"post-create inspect: timed out waiting for response, closing terminator"
"msg":"updated state","newState":3,"oldState":2,"reason":"post-create inspect timed out"
Router (SDK clients dropping inspect messages):
"msg":"dropped message. type [60799], sequence [98], replyFor [97]"
Controller (router requesting batch deletions):
"msg":"removed terminators","routerId":"8SGqhnW74C","terminatorIds":["...50 IDs..."]
Timeline from logs (per restart cycle):
- T+0min: Router starts, ~730 terminators created
- T+12min: evaluatePostCreateInspects times out, all terminators moved to state=3
- T+12min: evaluateDeleteQueue batch-deletes all terminators
- T+12min+: Only ~23 terminators remain (those from the router's own built-in tunneler or non-inspect sources)
Impact:
All services hosted by SDK-based tunnelers become unreachable. The issue is self-perpetuating — restarting the router temporarily restores service for ~12
minutes before the cycle repeats.
Workaround:
Rolling the router back to v2.0.0-pre3 resolves the issue, as pre3 does not include the post-create inspect feature.
Suggestion:
The post-create inspect feature appears to be incompatible with the current stable SDK release (ziti-edge-tunnel v1.11.1 / ziti-sdk-c 1.11.4). Possible
fixes:
- Make post-create inspect timeout behavior configurable (disable or extend timeout)
- Fall back gracefully when inspect response is not received (keep terminator alive instead of deleting)
- Ensure the stable SDK release supports the inspect message type before enabling it on the router side