Thank you for sending the stack dumps. I've dug through them, and they're not as useful as I was hoping. I do have a bit of a lead though, so I'm hoping you can answer a couple of questions to narrow things down:
Are you using legacy sessions or OIDC sessions?
Are you running in HA mode?
Are you doing anything unusual with cert management?
What it looks like, based on the symptoms, is that your controller id seems to be changing after a restart. The symptoms indicate that the id returned by the channel no longer matches the original value.
That's why you're seeing the router data model messages every 30 seconds. It looks up the controller it's connected to by the channel id, but doesn't find it in the map (which is keyed by the original id). It assumes that controller is gone and resubscribes (to the same controller).
That also explains the 'no controller available' on dial. If you're using legacy sessions, we stamp the api session with the source controller id, so we can be sure to ask that controller to dial. However, we stamp it with the controller id from the channel, and if that changes, we can't find the matching controller in the controllers map.
We can try and verify this. If you run ziti fabric inspect router-controllers it will show you the id in the map. You can run that before a restarting a stuck controller, and then again after and see if the id has changed.
I'm not sure how the controller id would be changing, I've not seen that before. It's possible I'm misinterpreting the symptoms, but that's my current best guess.
Let me know if you can spot any changes in the ids and if you have thoughts on how they might be changing.
Standard Ziti PKI via bootstrap, no custom cert management (Let's Encrypt only for the edge API listener)
Regarding ziti fabric inspect router-controllers: The CLI in v1.6.14 fails with Error: &{[] []} (*rest_model.InspectResponse) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface.
Tried all output formats (yaml, json, -j, --verbose) — same error. Is there another way to check the controller ID on this version?
I'm going to put in a fix so that the effective channel id of the router -> ctrl reconnecting channel can't change, which if my diagnosis is correct, should fix your problem. I'd still like to know how it's happening, though. Do you have two different controllers running behind a load-balancer?
You could also try the latest 2.0.0 prerelease, that doesn't allow the id to change after initial setup.
The command was failing with (*rest_model.InspectResponse) is not supported by the TextConsumer. Turned out to be a configuration issue — we had split web listeners:
CLI authenticates to :1280, but fabric binding was only on :443. After adding fabric to the :1280 listener, the command works:
UndXF-da4.router-controllers
controllers:
NetFoundry Inc. Client Rrgu3b2z4:
controllerId: NetFoundry Inc. Client Rrgu3b2z4
connected: true
version: v1.6.14
8IqWeohZl0.router-controllers
controllers:
NetFoundry Inc. Client Rrgu3b2z4:
controllerId: NetFoundry Inc. Client Rrgu3b2z4
connected: true
version: v1.6.14
After a controller restart (with fresh routers) the controllerId stays the same: NetFoundry Inc. Client Rrgu3b2z4. But the bug only triggers after 18+ hours of router uptime.
Will reproduce the stuck state tomorrow and check if the controllerId changes. Will post the before/after comparison here.
Re: ZITI_BOOTSTRAP settings: Our controller service.env has all bootstrap flags set to true, plus ZITI_AUTO_RENEW_CERTS='true'. The comments say "unless it exists" so it shouldn't overwrite, but we'll verify tomorrow along with the controllerId check. If the ID changes after 18h+ uptime restart, we'll try setting bootstrap flags to false and see if that fixes it.
@msbusk Thank you — your suggestion about ZITI_BOOTSTRAP settings was spot on!
We confirmed that ZITI_BOOTSTRAP='true' was regenerating the controller identity on every restart, causing the controller ID to change each time:
Restart 1: Rrgu3b2z4
Restart 2: ft1kQWD3t
Restart 3: 3rhu6YIQi
After setting all ZITI_BOOTSTRAP_* and ZITI_AUTO_RENEW_CERTS to false, the controller ID stays stable across restarts and the routers no longer get stuck. Verified with 22+ hours router uptime.