Router stability issues

I’ve built out a small environment consisting of a router/controller (1.6.8) co-located on a publicly available aws instance and a second private router (1.6.8) within an on-prem environment with a series of edge tunnelers in various places for some testing (and thus far have been pretty impressed).

I am running into what appears to be some stability issues in the router service though. During the course of the day of testing the fabric appears to crash (endpoints dialing services are no longer able to connect to any of them until the router service on the public instance is restarted). I have seen evidence of issue #3291 and was thinking it may be related but the volume of circuits being created/held open isn’t what i’d consider to be staggering (the most i’ve seen concurrently open is a few hundred). I’m not seeing abnormally high resource utilization/service crashes at the time of the issue.

looking for some guidance in troubleshooting (looking through the logs, i’m having a hard time determining what might be related and what is normal).

Hi @kragifel

Here's a quick doc with some guidance on gathering diagnostic information: How To Gather OpenZiti Diagnostics · openziti/ziti Wiki · GitHub

Since the router isn't dying, but is becoming unresponsive, gathering stack dumps is likely to be the most illuminating. If you use ziti agent stack, you can gather multiple stack dumps, maybe 30 seconds apart. If you send me them, I'll be happy to take a look and see if I can diagnose what's going on.

CPU and/or memory pprof might also be useful, if the stacks aren't definitive.

If you're curious and want to look for yourself, I use GitHub - openziti/goroutine-analyzer: Helps analyze goroutines. Inspired by TDA for Java and goroutine-inspect for golang. to examine stack dumps and go tool pprof -http :8080 <path to pprof file> to examine pprofs.

Let me know if you've got stuff to look at and/or other questions. Hopefully we can figure this out.

Thank you,
Paul

Hey Paul,

Appreciate the response. I took a look at collecting those stack dumps. I seem to be running into an issue with your ziti agent stack command. I get following:

ziti agent stack
error: no processes found matching filter, use 'ziti agent list' to list candidates

The ziti agent list command it suggests to identify candidates gives me a blank list. Is there a precursory step I’m missing?

also, if it helps - controller/router are built out on rocky 9 using instruction from here: Deploying on Linux | NetFoundry Documentation

disregard - believe I’ve had some success using ziti fabric inspect stackdump <router-id>. will grab a few.

stackdumps.zip (208.6 KB)

took following (attached) dumps while network operating as expected:
rtr-stackdump.20251016.122330-rtr-stackdump.20251016.122630
observed the issue @ 12:40
took following dumps while issue ongoing:
rtr-stackdump.20251016.124500-rtr-stackdump.20251016.124600
took following dumps after a router process restart (service restored)
rtr-stackdump.20251016.124730-rtr-stackdump.20251016.124800
going to see about taking a peek at them with your analyzer as well.

It looks like you're hitting this bug: Goroutine pool with a min worker count of 1 can drop to 0 workers due to race condition · Issue #452 · openziti/foundation · GitHub which is fixed in 1.6.9. I noticed I hadn't marked that release stable yet, but it's been running on our internal system for a few weeks now with no issues, so I've marked it stable now.

Not that it's all that relevant now, but I'm guessing the reason that ziti agent list and ziti agent stack weren't working was because the router process was running as a different user. Usually running those commands with sudo give the process the permissions it needs to the see the IPC pipe files.

If updating to 1.6.9 doesn't resolve the issue, let me know.

Thank you,
Paul

great! appreciate your help. have updated to 1.6.9 and will keep an eye on it.

when we say worker, are we talking workers within the router or are we talking routers within the network? any other tuning I should look at to account for the lack of multiple routers in this POC?

Worker here is a goroutine in the router. There's a goroutine pool and there existed a race condition that allowed pools with a min worker count of 1 to drop to 0, if multiple goroutines decided to finish at the same time. Once the pool dropped to 0 goroutines, it never recovered.

ah, i see. thanks again for the help/info. will monitor.

Just wanted to confirm - 1.6.9 appears to have resolved this issue.