Stack dumps for GitHub issue #3769 (router stuck after controller restart)

collection_log.zip (184.2 KB)

Hi @plorenz,

As requested in #3769, here are stack dumps from all components collected during bug reproduction.

Setup: v1.6.14, 1 controller + 1 public router (same VPS), 1 dark router (LXC). Router uptime ~24h before test.

Dump phases:

  1. baseline — system healthy, monitors running OK

  2. after-controller-restart — controller restarted, existing SDK connections survived

  3. stuck — monitors restarted (new SDK context) → permanent no controller available

  4. fixed — routers restarted, everything recovered

36 dump files (3 per component per phase, 10s intervals) + collection log attached as zip.

Hi @galex-gh ,

Thank you for sending the stack dumps. I've dug through them, and they're not as useful as I was hoping. I do have a bit of a lead though, so I'm hoping you can answer a couple of questions to narrow things down:

  1. Are you using legacy sessions or OIDC sessions?
  2. Are you running in HA mode?
  3. Are you doing anything unusual with cert management?

What it looks like, based on the symptoms, is that your controller id seems to be changing after a restart. The symptoms indicate that the id returned by the channel no longer matches the original value.

That's why you're seeing the router data model messages every 30 seconds. It looks up the controller it's connected to by the channel id, but doesn't find it in the map (which is keyed by the original id). It assumes that controller is gone and resubscribes (to the same controller).

That also explains the 'no controller available' on dial. If you're using legacy sessions, we stamp the api session with the source controller id, so we can be sure to ask that controller to dial. However, we stamp it with the controller id from the channel, and if that changes, we can't find the matching controller in the controllers map.

We can try and verify this. If you run ziti fabric inspect router-controllers it will show you the id in the map. You can run that before a restarting a stuck controller, and then again after and see if the id has changed.

I'm not sure how the controller id would be changing, I've not seen that before. It's possible I'm misinterpreting the symptoms, but that's my current best guess.

Let me know if you can spot any changes in the ids and if you have thoughts on how they might be changing.

Thank you,
Paul

Hi @plorenz ,

Answers to your questions:

  1. Legacy sessions (no OIDC configured)

  2. No HA — single controller

  3. Standard Ziti PKI via bootstrap, no custom cert management (Let's Encrypt only for the edge API listener)

Regarding ziti fabric inspect router-controllers: The CLI in v1.6.14 fails with
Error: &{[] []} (*rest_model.InspectResponse) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface.
Tried all output formats (yaml, json, -j, --verbose) — same error. Is there another way to check the controller ID on this version?

Are you running version 1.6.14 everywhere, including the CLI?

$ ziti-1.6.14 fabric inspect router-controllers
Results: (1)
txeMjStpgD.router-controllers
controllers:
  ctrl_client:
    address: tls:localhost:6262
    connected: true
    controllerId: ctrl_client
    isLeader: false
    latency: 641.008µs
    responsive: true
    timeSinceLastContact: 7.22s
    version: v1.6.14
$ ziti-1.6.14 fabric list routers
╭────────────┬───────────────┬────────┬──────┬──────────────┬──────────┬────────────────────────┬───────────╮
│ ID         │ NAME          │ ONLINE │ COST │ NO TRAVERSAL │ DISABLED │ VERSION                │ LISTENERS │
├────────────┼───────────────┼────────┼──────┼──────────────┼──────────┼────────────────────────┼───────────┤
│ txeMjStpgD │ edge-router-1 │ true   │    0 │ false        │ false    │ v1.6.14 on linux/amd64 │           │
╰────────────┴───────────────┴────────┴──────┴──────────────┴──────────┴────────────────────────┴───────────╯
results: 1-1 of 1
$

I'm going to put in a fix so that the effective channel id of the router -> ctrl reconnecting channel can't change, which if my diagnosis is correct, should fix your problem. I'd still like to know how it's happening, though. Do you have two different controllers running behind a load-balancer?

You could also try the latest 2.0.0 prerelease, that doesn't allow the id to change after initial setup.

Let me know,
Thank you,
Paul