Multi-router setup dropping circuits under load

Hi!

I’m using OpenZiti to secure connections between devices across my home and cloud environments. One of the protected services is Frigate (for my video camera system).

Right now, this setup is in my home lab, where I’m testing and evaluate it for a new environment in the near future. The goal is to validate the architecture, stability, and tuning under real workloads so we can use it permanently both at home and in production.

Here’s the current layout (diagram attached):

  • AWS
    • Controller
    • EdgeRouter (handles public internet traffic using #location:roaming)
  • Home
    • EdgeRouter (handles local traffic using #location:home)
    • Devices: Linux Desktop (multi-service: SSH, Minecraft, Raspberry Pi), Linux Server (muti-service: Frigate, SSH), Laptop at home
  • Remote Devices
    • Phone
    • Laptop (away from home)
    • Devices
    • Friend’s laptop

Everything works great, mostly. I can move traffic in all of the right directions, and things are connecting to routers.

However, right now I am testing Frigate from the office (Office Desktop → router-aws → router-home → Frigate), things start off fine, but once I start clicking around a bit, things stall out and break. They recover after a short while, but the hiccup is concerning. It is very repeatable.

I see a bunch of errors when this happens, and I am not sure if things are setup wrong.

The controller is logging errors like this:

Oct 22 14:24:58 ziti1 ziti[2891]: {"file":"github.com/openziti/ziti/controller/network/fault.go:32","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"network fault processing for [4] circuits","time":"2025-10-22T14:24:58.109Z"}
Oct 22 14:24:58 ziti1 ziti[2891]: {"circuitId":"2Ea3n5Kf8urZXrzC9XhwZk","file":"github.com/openziti/ziti/controller/network/fault.go:49","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"sent unroute for circuit to router in response to forwarding fault","routerId":"4lZ1.ZHAH.","time":"2025-10-22T14:24:58.109Z"}
Oct 22 14:24:58 ziti1 ziti[2891]: {"circuitId":"5et5DY0A53SLxQjkYih7ko","file":"github.com/openziti/ziti/controller/network/fault.go:49","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"sent unroute for circuit to router in response to forwarding fault","routerId":"4lZ1.ZHAH.","time":"2025-10-22T14:24:58.110Z"}
Oct 22 14:24:58 ziti1 ziti[2891]: {"circuitId":"4OjU0f3BmABmMSxWFErRVv","file":"github.com/openziti/ziti/controller/network/fault.go:49","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"sent unroute for circuit to router in response to forwarding fault","routerId":"4lZ1.ZHAH.","time":"2025-10-22T14:24:58.110Z"}
Oct 22 14:24:58 ziti1 ziti[2891]: {"circuitId":"1iFIV96F51ERKdK4oudCry","file":"github.com/openziti/ziti/controller/network/fault.go:49","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"sent unroute for circuit to router in response to forwarding fault","routerId":"4lZ1.ZHAH.","time":"2025-10-22T14:24:58.110Z"}

and services hosted by the identity (the frigate service and another ssh service) become unresponsive.

I see errors like this in the router at home:

Oct 22 14:24:56 meerkat1 ziti[4013743]: {"_context":"{c/1iFIV96F51ERKdK4oudCry|@/7BfglkaIFTqIarmgEL8iLA}\u003cTerminator\u003e","circuitId":"1iFIV96F51ERKdK4oudCry","error":"cannot forward payload, no forward table for circuit=1iFIV96F51ERKdK4oudCry src=7BfglkaIFTqIarmgEL8iLA","file":"github.com/openziti/ziti/router/handler_xgress/data_plane.go:58","func":"github.com/openziti/ziti/router/handler_xgress.(*dataPlaneAdapter).ForwardPayload","level":"error","msg":"unable to forward payload","origin":1,"seq":0,"time":"2025-10-22T14:24:56.420Z"}
Oct 22 14:24:56 meerkat1 ziti[4013743]: {"_channels":["establishPath"],"apiSessionId":"cmh21o3nv0xsx8beyo4m4qkqq","attempt":1,"attemptNumber":"2","binding":"edge","circuitId":"1iFIV96F51ERKdK4oudCry","context":"ch{ctrl}-\u003eu{reconnecting}-\u003ei{NetFoundry Inc. Client XIh5pyCvC/Vd3m}","destination":"4X9Js9SONGyIzIzERfZnsM","error":"error creating route for [c/1iFIV96F51ERKdK4oudCry]: timeout waiting for message reply: context deadline exceeded","file":"github.com/openziti/ziti/router/handler_ctrl/route.go:140","func":"github.com/openziti/ziti/router/handler_ctrl.(*routeHandler).fail","level":"error","msg":"failure while handling route update","serviceId":"4eAH77XoP3xgbu0hOwFU4m","sessionId":"cmh22ntcd0z3c8bey7xihtsnk","time":"2025-10-22T14:24:56.420Z"}
Oct 22 14:24:56 meerkat1 ziti[4013743]: {"circuitId":"1GVSFgr6Ce4MgWL8oGuOXx","ctrlId":"NetFoundry Inc. Client XIh5pyCvC","file":"github.com/openziti/ziti/router/forwarder/scanner.go:85","func":"github.com/openziti/ziti/router/forwarder.(*Scanner).scan","idleThreshold":60000000000,"idleTime":139886000000,"level":"warning","msg":"circuit exceeds idle threshold","time":"2025-10-22T14:24:56.924Z"}
Oct 22 14:24:56 meerkat1 ziti[4013743]: {"circuitId":"Ej6BcQ3iZ3I4HcuPnYFSL","ctrlId":"NetFoundry Inc. Client XIh5pyCvC","file":"github.com/openziti/ziti/router/forwarder/scanner.go:85","func":"github.com/openziti/ziti/router/forwarder.(*Scanner).scan","idleThreshold":60000000000,"idleTime":137073000000,"level":"warning","msg":"circuit exceeds idle threshold","time":"2025-10-22T14:24:56.924Z"}
Oct 22 14:24:56 meerkat1 ziti[4013743]: {"circuitId":"5uHE1PB7tj4bloN9gE0DYd","ctrlId":"NetFoundry Inc. Client XIh5pyCvC","file":"github.com/openziti/ziti/router/forwarder/scanner.go:85","func":"github.com/openziti/ziti/router/forwarder.(*Scanner).scan","idleThreshold":60000000000,"idleTime":131553000000,"level":"warning","msg":"circuit exceeds idle threshold","time":"2025-10-22T14:24:56.924Z"}
Oct 22 14:24:56 meerkat1 ziti[4013743]: {"circuitCount":3,"ctrlId":"NetFoundry Inc. Client XIh5pyCvC","file":"github.com/openziti/ziti/router/forwarder/scanner.go:100","func":"github.com/openziti/ziti/router/forwarder.(*Scanner).scan","level":"warning","msg":"sent confirmation for circuits","time":"2025-10-22T14:24:56.924Z"}
Oct 22 14:24:58 meerkat1 ziti[4013743]: {"circuitCount":4,"ctrlId":"NetFoundry Inc. Client XIh5pyCvC","file":"github.com/openziti/ziti/router/forwarder/faulter.go:107","func":"github.com/openziti/ziti/router/forwarder.(*Faulter).run","level":"warning","msg":"reported forwarding faults","time":"2025-10-22T14:24:58.087Z"}
Oct 22 14:24:58 meerkat1 ziti[4013743]: {"circuitId":"2Ea3n5Kf8urZXrzC9XhwZk","file":"github.com/openziti/ziti/router/forwarder/forwarder.go:155","func":"github.com/openziti/ziti/router/forwarder.(*Forwarder).Unroute","level":"info","msg":"circuit unrouted","time":"2025-10-22T14:24:58.126Z"}
Oct 22 14:24:58 meerkat1 ziti[4013743]: {"circuitId":"5et5DY0A53SLxQjkYih7ko","file":"github.com/openziti/ziti/router/forwarder/forwarder.go:155","func":"github.com/openziti/ziti/router/forwarder.(*Forwarder).Unroute","level":"info","msg":"circuit unrouted","time":"2025-10-22T14:24:58.126Z"}
Oct 22 14:24:58 meerkat1 ziti[4013743]: {"circuitId":"4OjU0f3BmABmMSxWFErRVv","file":"github.com/openziti/ziti/router/forwarder/forwarder.go:155","func":"github.com/openziti/ziti/router/forwarder.(*Forwarder).Unroute","level":"info","msg":"circuit unrouted","time":"2025-10-22T14:24:58.126Z"}
Oct 22 14:24:58 meerkat1 ziti[4013743]: {"circuitId":"1iFIV96F51ERKdK4oudCry","file":"github.com/openziti/ziti/router/forwarder/forwarder.go:155","func":"github.com/openziti/ziti/router/forwarder.(*Forwarder).Unroute","level":"info","msg":"circuit unrouted","time":"2025-10-22T14:24:58.126Z"}

and a few errors in the cloud router, but that one is not as noisy.

Oct 22 14:24:51 ziti1 ziti[6992]: {"_context":"ch{edge}-\u003eu{classic}-\u003ei{ziti-sdk-c[1]@officemini-m4/koVv}","chSeq":653,"connId":77,"edgeSeq":0,"error":"exceeded maximum [2] retries creating circuit [c/2Ea3n5Kf8urZXrzC9XhwZk] (error creating route for [s/2Ea3n5Kf8urZXrzC9XhwZk] on [r/4lZ1.ZHAH.] (error creating route for [c/2Ea3n5Kf8urZXrzC9XhwZk]: timeout waiting for message reply: context deadline exceeded))","file":"github.com/openziti/ziti/router/xgress_edge/listener.go:1146","func":"github.com/openziti/ziti/router/xgress_edge.(*nonXgConnectHandler).FinishConnect","level":"warning","msg":"failed to dial fabric","time":"2025-10-22T14:24:51.405Z","token":"ccc6c1df-fe77-4a37-ae58-6fbac63870aa","type":"EdgeConnectType"}
Oct 22 14:24:56 ziti1 ziti[6992]: {"circuitId":"1iFIV96F51ERKdK4oudCry","file":"github.com/openziti/ziti/router/forwarder/forwarder.go:155","func":"github.com/openziti/ziti/router/forwarder.(*Forwarder).Unroute","level":"info","msg":"circuit unrouted","time":"2025-10-22T14:24:56.440Z"}
Oct 22 14:24:56 ziti1 ziti[6992]: {"_context":"ch{edge}-\u003eu{classic}-\u003ei{ziti-sdk-c[1]@officemini-m4/koVv}","chSeq":692,"connId":78,"edgeSeq":0,"error":"exceeded maximum [2] retries creating circuit [c/1iFIV96F51ERKdK4oudCry] (error creating route for [s/1iFIV96F51ERKdK4oudCry] on [r/4lZ1.ZHAH.] (error creating route for [c/1iFIV96F51ERKdK4oudCry]: timeout waiting for message reply: context deadline exceeded))","file":"github.com/openziti/ziti/router/xgress_edge/listener.go:1146","func":"github.com/openziti/ziti/router/xgress_edge.(*nonXgConnectHandler).FinishConnect","level":"warning","msg":"failed to dial fabric","time":"2025-10-22T14:24:56.440Z","token":"ccc6c1df-fe77-4a37-ae58-6fbac63870aa","type":"EdgeConnectType"}

What I’m trying to figure out:

  • Are these caused by circuit exhaustion or route churn between routers?Do I need to adjust timeouts, retries, or circuit lifetimes for multi-router setups?
  • Is Frigate’s “bursty” behavior (multiple parallel connections) overloading circuit creation or route negotiation?
  • Am I missing some kind of tuning parameters?

Edit: Services on the “Linux Server” identity where Frigate is running are what become unresponsive (like ssh; an existing ssh session will just lock up). Other services using the home router continue working fine.

Edit2: If I change the attributes for “Linux Server” to use the cloud router, the issue goes away. Something is going on when traffic is transiting between two routers. The video streams are larger payloads (mostly .m3u8/.ts video streams, so the .ts files are larger payloads)

I appreciate any help!

Thanks,
Kris

1 Like

Hi @Krishopper

My first thought was that this was related to the SDK <-> Router bottleneck, which comes into play when you've got multiple circuits going through the SDK. The reason I say this is that it looks like you're getting dial timeouts. If your underlay is generally capable and you're not experiencing temporary network issues, then dial timeouts are caused by existing circuits slowing down the control messaging that working to set up new circuits.

Staring with the background: Historically, flow control/retransmission was handled in the routers. In OpenZiti the component that handles this is called xgress (for ingress/egress).

So you'd have TCP managing flow control/retransmission from the SDK to the routers, then xgress from initiating router to terminating router, then potentially another leg with TCP to another SDK.

If you've got multiple circuits from the SDK to the router, or you're trying to set up new circuits while there are existing busy circuits, the SDKs can run into problems. You can end up with something like

sdk -> dial response -> data payloads for existing circuits -> router with back-pressured xgress.

If you've got a back-pressured receiver in the router, the TCP connection will also back-pressure and traffic from the SDK to the router can halt until the back-pressure is relieved.

Even without a slow circuit causing back-pressure, if you're doing high bandwidth transfers, control data can get stuck behind data for other circuits just long enough to cause a timeout.

In cases where tunnelers are being used on the desktop or on a phone, they're generally serving only a few things at time, so in practice, this hasn't often been an issue. If you're using the SDK or a tunneler as a gateway or a proxy, serving lots of clients, the problem is much more visible.

To address this, we've done the following (so far):

  1. Allow using the router as a tunneler. This way there's no single TCP connection acting as a choke point. The xgress in the router allow us to drop payloads if there's a slow receiver anywhere on the path, since it will re-transmit as needed. This ensure that a slow circuit won't cause problems for other circuits.
  2. Allow SDKs to set up a second connection to the router for control messaging. This ensures that control data doesn't get caught behind data payloads. Currently this is only supported by the Go SDK.
  3. Allow SDKs to manage the xgress in the SDK. This allows the router to drop payloads if something is slow. It also allows the SDK to set up multiple data connections, since xgress will also handle the payload re-ordering to fix the out-of-order payloads caused by using multiple connections. Also only currently supported by the Go SDK.

It is a little puzzling that removing one of the routers from the path eliminates the issue. Maybe by having both xgress endpoints in the same router, the delay is minimized and the amount of time that the router is back-pressuring is minimized.

What are you currently using as endpoints? I would be curious if you could use the ER/T on the linux server and try using ziti tunnel tproxy on the linux desktop, as that will enable the sdk-flow-control and multi-channel functionality. If that fixes the issue, then my hypothesis is likely correct. If not, then we'll have to dig in more.

thank you,
Paul

1 Like

Thank you for the in depth response with the details. I’m still learning how the OpenZiti components work, and the information about xgress and control flow traffic is very helpful.

What are you currently using as endpoints?

I am including a visual diagram, since my wording might have been confusing.

There are some Minecraft endpoints and such on the “Linux Desktop” machine, but I didn’t include them above as they’re not important for this conversation.

To clarify, I’ve only seen this issue with frigate so far. The connection from Home → AWS EC2 is 1000/40Mbit with about 40ms latency, so there is plenty of capacity and no packet loss.

I would be curious if you could use the ER/T on the linux server and try using ziti tunnel tproxy on the linux desktop, as that will enable the sdk-flow-control and multi-channel functionality. If that fixes the issue, then my hypothesis is likely correct. If not, then we'll have to dig in more.

I just changed the frigate bind policy to bind to ziti-router on Linux Server, and everything is working. Since they’re on the same machine, I am fine with that in this situation. But I would like to deploy OpenZiti to a new environment in the near future and want to learn as much as I can so I can effectively troubleshoot things like this when they happen. The last thing I want is a user to report that they cannot maintain a connection to something and then I have no idea where to look (which is another reason why your xgress/controller information is extremely valuable to me).

Given that, and the visual diagram to help clear up the architecture a bit, it would be nice to understand why it falls apart when the service is bound to ziti-edge-tunnel on the Linux server. Should I enable tproxy on the Linux server, where Frigate is? Since I want to understand the components, I’m not clear why I would enable it on the Linux Desktop in this situation, so I want to clarify before testing further :slight_smile:

Thanks again, and I appreciate your time helping with this.

-Kris

Let's look at each side individually:

Server
If you have a router running on the server already and you don't need any ZET specific features, then we generally recommend using the tunneler functionality built into the router. It simplifies the deployment architecture and usually providers better performance as well. Usually servers are only hosting services, so you don't need use tproxy on the server.

Steps to enable:

  1. Ensure the router is configured as tunneler enabled in the model: ziti edge update edge-router <edge router name or id> -t
  2. Ensure the tunnel binding is enabled in the router configuration:
listeners:
  - binding: tunnel
    options:
      mode: host
  1. Routers which are tunneler enable have an associated identity. If you give that identity the same role attributes as you have for the ZET identity, it should start hosting the services.

You can check that it's working by checking the terminators for the services in question. They should have a terminator on that router with a binding of tunnel.

ziti fabric list terminators

Client
We generally don't recommend running the edge router/tunneler (ER/T) combo on clients, unless it's acting as a gateway/proxy for other systems, in which cases it's more of a server than an end client.

What I was hoping you'd try is the Go tunneler. The ZET uses the C sdk and is our recommended tunneler. We do have a second tunneler, which is Go based and works only on linux using tproxy. It's what the ER/T uses as its tunneling component, since the router is also Go based.

Currently the Go tunneler has a couple of features that are still under development in the C sdk, namely SDK based flow-control and a separated control channels. If the Go tunneler doesn't show the same issues, that will put some additional pressure on getting those features into the C sdk.

The Go tunneler is CLI only and is run using the ziti command.

ziti tunnel tproxy -i <path to identity json file>

Alternatively you could try and run the router in proxy mode to see if that resolves the issue. That would also help narrow down the problem space.

If the problem still happens when the ZET is removed from data path, then my hypothesis is incorrect and we'll need to dig further.

Let me know if that clarifies things.
Paul

1 Like

Got it. This is super helpful, thank you.

I’ll give the Go tunneler a try early next week when I’m back at the office machine, where I’ll be able to properly test the connectivity.

Appreciate your time and help with this.

Hi @plorenz,

I put the Go tunneler on the Linux server and ran it with tproxy in palce of ZET, and then bind the service back to that identity, and it worked as expected. So things break when ZET is in play.

I also confirmed with more testing that it breaks on the local network even for other traffic that is hitting the router on the Linux server (so it doesn’t need to hop to another router for it to break). The issue seems to be completely within ZET.

Since there is a router on that machine anyways, I’ll continue to use that to bind the services to since that is working as desired.

Thank you for following up. Good to know that we've isolated the issue and have a workaround. Getting those features added to the ZET is on the backlog, it's important for us to know that this is hitting people in real world usage.

Thank you,
Paul