Stack dumps for GitHub issue #3769 (router stuck after controller restart)

galex-gh · April 6, 2026, 10:02am

collection_log.zip (184.2 KB)

Hi @plorenz,

As requested in #3769, here are stack dumps from all components collected during bug reproduction.

Setup: v1.6.14, 1 controller + 1 public router (same VPS), 1 dark router (LXC). Router uptime ~24h before test.

Dump phases:

baseline — system healthy, monitors running OK
after-controller-restart — controller restarted, existing SDK connections survived
stuck — monitors restarted (new SDK context) → permanent no controller available
fixed — routers restarted, everything recovered

36 dump files (3 per component per phase, 10s intervals) + collection log attached as zip.

plorenz · April 7, 2026, 4:18am

Hi @galex-gh ,

Thank you for sending the stack dumps. I've dug through them, and they're not as useful as I was hoping. I do have a bit of a lead though, so I'm hoping you can answer a couple of questions to narrow things down:

Are you using legacy sessions or OIDC sessions?
Are you running in HA mode?
Are you doing anything unusual with cert management?

What it looks like, based on the symptoms, is that your controller id seems to be changing after a restart. The symptoms indicate that the id returned by the channel no longer matches the original value.

That's why you're seeing the router data model messages every 30 seconds. It looks up the controller it's connected to by the channel id, but doesn't find it in the map (which is keyed by the original id). It assumes that controller is gone and resubscribes (to the same controller).

That also explains the 'no controller available' on dial. If you're using legacy sessions, we stamp the api session with the source controller id, so we can be sure to ask that controller to dial. However, we stamp it with the controller id from the channel, and if that changes, we can't find the matching controller in the controllers map.

We can try and verify this. If you run ziti fabric inspect router-controllers it will show you the id in the map. You can run that before a restarting a stuck controller, and then again after and see if the id has changed.

I'm not sure how the controller id would be changing, I've not seen that before. It's possible I'm misinterpreting the symptoms, but that's my current best guess.

Let me know if you can spot any changes in the ids and if you have thoughts on how they might be changing.

Thank you,
Paul

galex-gh · April 7, 2026, 7:34am

Hi @plorenz ,

Answers to your questions:

Legacy sessions (no OIDC configured)
No HA — single controller
Standard Ziti PKI via bootstrap, no custom cert management (Let's Encrypt only for the edge API listener)

Regarding ziti fabric inspect router-controllers: The CLI in v1.6.14 fails with
Error: &{[] []} (*rest_model.InspectResponse) is not supported by the TextConsumer, can be resolved by supporting TextUnmarshaler interface.
Tried all output formats (yaml, json, -j, --verbose) — same error. Is there another way to check the controller ID on this version?

plorenz · April 7, 2026, 3:09pm

Are you running version 1.6.14 everywhere, including the CLI?

$ ziti-1.6.14 fabric inspect router-controllers
Results: (1)
txeMjStpgD.router-controllers
controllers:
  ctrl_client:
    address: tls:localhost:6262
    connected: true
    controllerId: ctrl_client
    isLeader: false
    latency: 641.008µs
    responsive: true
    timeSinceLastContact: 7.22s
    version: v1.6.14
$ ziti-1.6.14 fabric list routers
╭────────────┬───────────────┬────────┬──────┬──────────────┬──────────┬────────────────────────┬───────────╮
│ ID         │ NAME          │ ONLINE │ COST │ NO TRAVERSAL │ DISABLED │ VERSION                │ LISTENERS │
├────────────┼───────────────┼────────┼──────┼──────────────┼──────────┼────────────────────────┼───────────┤
│ txeMjStpgD │ edge-router-1 │ true   │    0 │ false        │ false    │ v1.6.14 on linux/amd64 │           │
╰────────────┴───────────────┴────────┴──────┴──────────────┴──────────┴────────────────────────┴───────────╯
results: 1-1 of 1
$

I'm going to put in a fix so that the effective channel id of the router -> ctrl reconnecting channel can't change, which if my diagnosis is correct, should fix your problem. I'd still like to know how it's happening, though. Do you have two different controllers running behind a load-balancer?

You could also try the latest 2.0.0 prerelease, that doesn't allow the id to change after initial setup.

Let me know,
Thank you,
Paul

msbusk · April 8, 2026, 7:37am

We had the same problem and found out it was caused by the following settings in the controller Docker Compose file being set to true:

ZITI_BOOTSTRAP=false — master bootstrap flag
ZITI_BOOTSTRAP_PKI=false — generation of PKI certificates
ZITI_BOOTSTRAP_CONFIG=false — generation of the config file
ZITI_BOOTSTRAP_DATABASE=false — initialization of the database
ZITI_BOOTSTRAP_CONSOLE=false — setup of the ZAC console

Try changing them to false in your Docker Compose file, and then I don’t think your controller will keep getting a new Controller ID anymore

galex-gh · April 8, 2026, 9:50am

Re: ziti fabric inspect router-controllers:

The command was failing with (*rest_model.InspectResponse) is not supported by the TextConsumer. Turned out to be a configuration issue — we had split web listeners:

# Listener 1 (public, 0.0.0.0:443)
bindings: edge-client, fabric

# Listener 2 (internal Headscale IP, 100.64.x.x:1280)
bindings: edge-management, zac

CLI authenticates to :1280, but fabric binding was only on :443. After adding fabric to the :1280 listener, the command works:

 UndXF-da4.router-controllers
 controllers:
   NetFoundry Inc. Client Rrgu3b2z4:
     controllerId: NetFoundry Inc. Client Rrgu3b2z4
     connected: true
     version: v1.6.14
 
 8IqWeohZl0.router-controllers
 controllers:
   NetFoundry Inc. Client Rrgu3b2z4:
     controllerId: NetFoundry Inc. Client Rrgu3b2z4
     connected: true
     version: v1.6.14

After a controller restart (with fresh routers) the controllerId stays the same: NetFoundry Inc. Client Rrgu3b2z4. But the bug only triggers after 18+ hours of router uptime.

Will reproduce the stuck state tomorrow and check if the controllerId changes. Will post the before/after comparison here.

Re: ZITI_BOOTSTRAP settings: Our controller service.env has all bootstrap flags set to true, plus ZITI_AUTO_RENEW_CERTS='true'. The comments say "unless it exists" so it shouldn't overwrite, but we'll verify tomorrow along with the controllerId check. If the ID changes after 18h+ uptime restart, we'll try setting bootstrap flags to false and see if that fixes it.

galex-gh · April 10, 2026, 8:07am

@msbusk Thank you — your suggestion about ZITI_BOOTSTRAP settings was spot on!

We confirmed that ZITI_BOOTSTRAP='true' was regenerating the controller identity on every restart, causing the controller ID to change each time:

Restart 1: Rrgu3b2z4
Restart 2: ft1kQWD3t
Restart 3: 3rhu6YIQi

After setting all ZITI_BOOTSTRAP_* and ZITI_AUTO_RENEW_CERTS to false, the controller ID stays stable across restarts and the routers no longer get stuck. Verified with 22+ hours router uptime.

Topic		Replies	Views
Controller upgrade issue Support	4	299	September 29, 2022
Missing Terminators after Controller reboot Support	9	105	May 6, 2025
Controller And Router Upgrades Causing Connectivity Issues Support	19	268	November 3, 2025
Router connection to Controller, handshake failed	5	109	April 16, 2025
Controller does not work if instance was restarted Ziti Overlay	5	56	July 8, 2024

Stack dumps for GitHub issue #3769 (router stuck after controller restart)

Related topics