Router v2.0.0-pre6 deletes all terminators after ~12 minutes due to post-create inspect timeout with stable SDK clients

Environment:

  • Controller: openziti/ziti-controller:2.0.0-pre6 (Docker)
  • Router: openziti/ziti-router:2.0.0-pre6 (Docker)
  • SDK clients: ziti-edge-tunnel v1.11.1 (Windows, latest stable) + ziti-sdk-c on Linux hosts
  • Deployment: Single controller, single router, ~700 services/terminators

Problem:

After upgrading from v2.0.0-pre3 to v2.0.0-pre6, all terminators (~700) are created successfully on router startup, but are systematically deleted
approximately 12 minutes later, leaving only ~23 terminators. This cycle repeats on every router restart.

Root cause analysis:

Through log analysis, we identified the following sequence:

  1. Router starts and creates ~730 terminators as expected
  2. Router sends post-create inspect requests (message type 60799) to SDK clients for each terminator
  3. SDK clients (ziti-edge-tunnel v1.11.1 / ziti-sdk-c) advertise SupportsInspectHeader=true in their bind requests, but do not understand or handle message
    type 60799
  4. Router logs confirm the SDK clients drop the messages: "dropped message. type [60799]"
  5. After ~10 minutes + 3 retries (hardcoded in hostedServiceRegistry.evaluatePostCreateInspects), the router marks all terminators as state=3 (deleting) with
    reason: "post-create inspect timed out"
  6. Router batch-deletes all terminators via evaluateDeleteQueue / RemoveTerminatorsRateLimited
  7. Result: ~700 terminators deleted, services become unreachable

Relevant log entries:

Router (inspect timeout triggering deletion):
"msg":"post-create inspect: timed out waiting for response, closing terminator"
"msg":"updated state","newState":3,"oldState":2,"reason":"post-create inspect timed out"

Router (SDK clients dropping inspect messages):
"msg":"dropped message. type [60799], sequence [98], replyFor [97]"

Controller (router requesting batch deletions):
"msg":"removed terminators","routerId":"8SGqhnW74C","terminatorIds":["...50 IDs..."]

Timeline from logs (per restart cycle):

  • T+0min: Router starts, ~730 terminators created
  • T+12min: evaluatePostCreateInspects times out, all terminators moved to state=3
  • T+12min: evaluateDeleteQueue batch-deletes all terminators
  • T+12min+: Only ~23 terminators remain (those from the router's own built-in tunneler or non-inspect sources)

Impact:

All services hosted by SDK-based tunnelers become unreachable. The issue is self-perpetuating — restarting the router temporarily restores service for ~12
minutes before the cycle repeats.

Workaround:

Rolling the router back to v2.0.0-pre3 resolves the issue, as pre3 does not include the post-create inspect feature.

Suggestion:

The post-create inspect feature appears to be incompatible with the current stable SDK release (ziti-edge-tunnel v1.11.1 / ziti-sdk-c 1.11.4). Possible
fixes:

  • Make post-create inspect timeout behavior configurable (disable or extend timeout)
  • Fall back gracefully when inspect response is not received (keep terminator alive instead of deleting)
  • Ensure the stable SDK release supports the inspect message type before enabling it on the router side
1 Like

Thank you for the bug report. Very clear and helpful! I'll work with the c-sdk folks to make sure we resolve with a -pre7.

Cheers,
Paul

I think we've resolved the issue. -pre7 is tagged and building and will be available shortly.

Thank you again for testing and reporting issues, much appreciated.
Paul

That is great thanks a lot I will test the new pre release, I have an feeling we are getting close to the final release of 2.0 and that is great that’s for your efforts building this great project

2 Likes