Router v2.0.0-pre6 deletes all terminators after ~12 minutes due to post-create inspect timeout with stable SDK clients

Environment:

  • Controller: openziti/ziti-controller:2.0.0-pre6 (Docker)
  • Router: openziti/ziti-router:2.0.0-pre6 (Docker)
  • SDK clients: ziti-edge-tunnel v1.11.1 (Windows, latest stable) + ziti-sdk-c on Linux hosts
  • Deployment: Single controller, single router, ~700 services/terminators

Problem:

After upgrading from v2.0.0-pre3 to v2.0.0-pre6, all terminators (~700) are created successfully on router startup, but are systematically deleted
approximately 12 minutes later, leaving only ~23 terminators. This cycle repeats on every router restart.

Root cause analysis:

Through log analysis, we identified the following sequence:

  1. Router starts and creates ~730 terminators as expected
  2. Router sends post-create inspect requests (message type 60799) to SDK clients for each terminator
  3. SDK clients (ziti-edge-tunnel v1.11.1 / ziti-sdk-c) advertise SupportsInspectHeader=true in their bind requests, but do not understand or handle message
    type 60799
  4. Router logs confirm the SDK clients drop the messages: "dropped message. type [60799]"
  5. After ~10 minutes + 3 retries (hardcoded in hostedServiceRegistry.evaluatePostCreateInspects), the router marks all terminators as state=3 (deleting) with
    reason: "post-create inspect timed out"
  6. Router batch-deletes all terminators via evaluateDeleteQueue / RemoveTerminatorsRateLimited
  7. Result: ~700 terminators deleted, services become unreachable

Relevant log entries:

Router (inspect timeout triggering deletion):
"msg":"post-create inspect: timed out waiting for response, closing terminator"
"msg":"updated state","newState":3,"oldState":2,"reason":"post-create inspect timed out"

Router (SDK clients dropping inspect messages):
"msg":"dropped message. type [60799], sequence [98], replyFor [97]"

Controller (router requesting batch deletions):
"msg":"removed terminators","routerId":"8SGqhnW74C","terminatorIds":["...50 IDs..."]

Timeline from logs (per restart cycle):

  • T+0min: Router starts, ~730 terminators created
  • T+12min: evaluatePostCreateInspects times out, all terminators moved to state=3
  • T+12min: evaluateDeleteQueue batch-deletes all terminators
  • T+12min+: Only ~23 terminators remain (those from the router's own built-in tunneler or non-inspect sources)

Impact:

All services hosted by SDK-based tunnelers become unreachable. The issue is self-perpetuating — restarting the router temporarily restores service for ~12
minutes before the cycle repeats.

Workaround:

Rolling the router back to v2.0.0-pre3 resolves the issue, as pre3 does not include the post-create inspect feature.

Suggestion:

The post-create inspect feature appears to be incompatible with the current stable SDK release (ziti-edge-tunnel v1.11.1 / ziti-sdk-c 1.11.4). Possible
fixes:

  • Make post-create inspect timeout behavior configurable (disable or extend timeout)
  • Fall back gracefully when inspect response is not received (keep terminator alive instead of deleting)
  • Ensure the stable SDK release supports the inspect message type before enabling it on the router side

Thank you for the bug report. Very clear and helpful! I'll work with the c-sdk folks to make sure we resolve with a -pre7.

Cheers,
Paul

I think we've resolved the issue. -pre7 is tagged and building and will be available shortly.

Thank you again for testing and reporting issues, much appreciated.
Paul

That is great thanks a lot I will test the new pre release, I have an feeling we are getting close to the final release of 2.0 and that is great that’s for your efforts building this great project