Router stability issues

kragifel · October 15, 2025, 1:44pm

I’ve built out a small environment consisting of a router/controller (1.6.8) co-located on a publicly available aws instance and a second private router (1.6.8) within an on-prem environment with a series of edge tunnelers in various places for some testing (and thus far have been pretty impressed).

I am running into what appears to be some stability issues in the router service though. During the course of the day of testing the fabric appears to crash (endpoints dialing services are no longer able to connect to any of them until the router service on the public instance is restarted). I have seen evidence of issue #3291 and was thinking it may be related but the volume of circuits being created/held open isn’t what i’d consider to be staggering (the most i’ve seen concurrently open is a few hundred). I’m not seeing abnormally high resource utilization/service crashes at the time of the issue.

looking for some guidance in troubleshooting (looking through the logs, i’m having a hard time determining what might be related and what is normal).

plorenz · October 15, 2025, 6:37pm

Hi @kragifel

Here's a quick doc with some guidance on gathering diagnostic information: How To Gather OpenZiti Diagnostics · openziti/ziti Wiki · GitHub

Since the router isn't dying, but is becoming unresponsive, gathering stack dumps is likely to be the most illuminating. If you use ziti agent stack, you can gather multiple stack dumps, maybe 30 seconds apart. If you send me them, I'll be happy to take a look and see if I can diagnose what's going on.

CPU and/or memory pprof might also be useful, if the stacks aren't definitive.

If you're curious and want to look for yourself, I use GitHub - openziti/goroutine-analyzer: Helps analyze goroutines. Inspired by TDA for Java and goroutine-inspect for golang. to examine stack dumps and go tool pprof -http :8080 <path to pprof file> to examine pprofs.

Let me know if you've got stuff to look at and/or other questions. Hopefully we can figure this out.

Thank you,
Paul

kragifel · October 16, 2025, 4:03pm

Hey Paul,

Appreciate the response. I took a look at collecting those stack dumps. I seem to be running into an issue with your ziti agent stack command. I get following:

ziti agent stack
error: no processes found matching filter, use 'ziti agent list' to list candidates

The ziti agent list command it suggests to identify candidates gives me a blank list. Is there a precursory step I’m missing?

also, if it helps - controller/router are built out on rocky 9 using instruction from here: Deploying on Linux | NetFoundry Documentation

kragifel · October 16, 2025, 4:21pm

disregard - believe I’ve had some success using ziti fabric inspect stackdump <router-id>. will grab a few.

kragifel · October 16, 2025, 5:06pm

stackdumps.zip (208.6 KB)

took following (attached) dumps while network operating as expected:
rtr-stackdump.20251016.122330-rtr-stackdump.20251016.122630
observed the issue @ 12:40
took following dumps while issue ongoing:
rtr-stackdump.20251016.124500-rtr-stackdump.20251016.124600
took following dumps after a router process restart (service restored)
rtr-stackdump.20251016.124730-rtr-stackdump.20251016.124800
going to see about taking a peek at them with your analyzer as well.

plorenz · October 16, 2025, 6:12pm

It looks like you're hitting this bug: Goroutine pool with a min worker count of 1 can drop to 0 workers due to race condition · Issue #452 · openziti/foundation · GitHub which is fixed in 1.6.9. I noticed I hadn't marked that release stable yet, but it's been running on our internal system for a few weeks now with no issues, so I've marked it stable now.

Not that it's all that relevant now, but I'm guessing the reason that ziti agent list and ziti agent stack weren't working was because the router process was running as a different user. Usually running those commands with sudo give the process the permissions it needs to the see the IPC pipe files.

If updating to 1.6.9 doesn't resolve the issue, let me know.

Thank you,
Paul

kragifel · October 16, 2025, 6:27pm

great! appreciate your help. have updated to 1.6.9 and will keep an eye on it.

when we say worker, are we talking workers within the router or are we talking routers within the network? any other tuning I should look at to account for the lack of multiple routers in this POC?

plorenz · October 16, 2025, 6:31pm

Worker here is a goroutine in the router. There's a goroutine pool and there existed a race condition that allowed pools with a min worker count of 1 to drop to 0, if multiple goroutines decided to finish at the same time. Once the pool dropped to 0 goroutines, it never recovered.

kragifel · October 16, 2025, 6:49pm

ah, i see. thanks again for the help/info. will monitor.

kragifel · October 22, 2025, 6:52pm

Just wanted to confirm - 1.6.9 appears to have resolved this issue.

Topic		Replies	Views
How to track down throughput problems on the Ziti router Support	36	632	December 18, 2023
Problem with "zitiLogin" and "stopRouter" Ziti Overlay	1	172	January 31, 2023
Ziti Router shutting down Support	13	122	April 23, 2025
Multi-router setup dropping circuits under load Support	7	135	November 17, 2025
Ntermittent Disconnection Issue – Ziti Controller & Public Router in GKE (v1.15)	21	197	April 3, 2025

Router stability issues

Related topics