Frequent "Failed to dial fabric" for second router

Hi there,

I am self hosting ziti and zrok on the same machine. I have a ziti edge router, controller, and a newly added ziti edge router in docker on that same machine.

Originally, everything ran fine with no downtime, but as of 2 months ago, the main router would begin to report “failed to dial fabric”, and all of my network shares went down. I could not find much info on this, but I figured adding a second edge-router to sort of split the load would help.

This worked for a little bit, but now I have to keep re-composing the router because about every 6 hours, the same issue arises. Failed to dial fabric, xyz has no terminators, etc.. I’m not sure what information is helpful to add here, but I will say I haven’t done much configuration on the second router except give it its own FQDN, port that is +1 from the original router (3023 vs 3022).

The router still shows online whenever it is in this failing state, docker says it’s healthy, and running “ziti edge list edge-routers” reports online for both.

I’d be happy to report any debugging or logs, just tell me how to find them as I’m fairly new to ziti.

Edit for more info:

Adding the second router did initially add some redundancy, where when one router collapsed, shares moved to the other.

Hi @SynthwaveFox

If you're seeing 'no terminators' that usually indicates a hosting problem, where either the hosting SDK is down or having troubles maintaining an authentication session with the controller.

Can you share:

  1. The versions of the controller and routers you are running
  2. How you're hosting - via tunneler/embedded sdk (if so which language) or edge-router/tunneler
  3. Versions of the hosting software

We have been working through some issues with the move to OIDC/JWT based session, so if you're using a tunneler to host, it's possible it's a known issue that's already been fixed.

Thank you,
Paul

@plorenz I will try my best to answer this given my lack of experience with the software.

The router running as a systemd service is v1.6.9, and the one in docker is v1.6.12. The controller also claims to be v1.6.9.

Edge-router, self-hosted, one running in Docker using the official openziti/ziti-router image (Docker Compose), and the other as a systemd service.
Not using an embedded SDK or tunneler. The router running in systemd seems to run more stable than the docker one. Is the version mismatch a potential issue?

The routers are enrolled with JWTs as well.

Update: at 18:40 EST, all my network shares just went down, unclear why but this is common for me. Its like a router is being overloaded or something once one of them goes down. There’s a lot of info in the logs, but I’m seeing circuit unrouted and error proxying: unable to dial service 'panel' (dial failed: service… plus no destination for circuit and has no terminators

Are all the dial failed messages 'no terminators'? If so, we need to figure out what's happening to the terminators.

One thing to try is adding event logging. See here for reference: Events | NetFoundry Documentation

If you add

events:
  jsonLogger:
    subscriptions:
      - type: circuit
      - type: terminator
    handler:
      type: file
      format: json
      path: <path to events log>

The terminator events will show when terminators are created/update/deleted and when their routers go off-line or on-line. That should show us why there aren't terminators for the service. Either the router is going offline, or the terminator is getting deleted. In either case we can check the router logs to see if the network connection is just flaky or if there's something else going on.

Cheers,
Paul

Hi again. I added the following blocks to my routers’ config.yml.

events:
  jsonLogger:
    subscriptions:
      - type: circuit
      - type: terminator
    handler:
      type: file
      format: json
      path: /tmp/router1.json

It ran for long enough to have “circuit unrouted” errors, but no log was made. I tried it with both .log and .json, but for some reason I couldn’t get a log out of it.

For router1, I added the above block and restarted the systemd service. For router2, I manually added a similar block to the same config.yml in its docker volume and recomposed.

I do however have an entire systemd log (and docker log) of the service, but it is quite lengthy as it ran about 2 hours before circuit unrouted and other errors. I do not know at this moment why the log from zrok’s config did not get made, either I did not make the config correctly, I did not reinitialize changes correctly, or something else.

Hi, I should have been more clear. The events: block should go in the controller config file. Router events get push up to the controller so there's a unified location for events.

Let me know how it goes,
Paul

None of my services have failed yet, but there are logs regarding circuits and terminators; still no log has been made in the specified directory. I did indeed add:

events:
  jsonLogger:
    subscriptions:
      - type: circuit
      - type: terminator
    handler:
      type: file
      format: json
      path: /tmp/events.json

into my /var/lib/private/ziti-controller/config.yml and /var/lib/ziti-controller/config.yml (unsure which is actually it or if they are the same).

There’s a few circuit unrouted and terminator removed messages inside the docker container’s status for the router.

I could provide an entire log of the controller/routers since start, but that would be an incredibly long file and include some IPs.

Just to be sure, did you restart your controller after making the config updates? It unfortunately can't pick changes up at runtime.

One other note, the events.json file in your /tmp directory should created on controller start up, so if it's not there after you restart, something has already gone wrong.

Paul

Yes, I rebooted the controllers, routers, and found out for some reason, the controller did not like the tmp directory (infamous Linux permissions probably), so I put the log in /var/lib/private/ziti-controller/logs/ and it wrote immediately. I will follow up when this log generates anything noteworthy, but it could be a few days until shares fall.

Nice, glad we've got some data coming. I'll keep any eye for updates.

@plorenz

One router or service finally collapsed for a moment, and there are some No Terminator lines in this log. I’m not sure if this is enough to conclude anything, but a web service of mine stopped working temporarily (possibly switching routers) at 2-15-2026 17:00 EST +/- 5 minutes due to an uptime probe (22:00 in server log).
Router 2 is indicating some new terminator creation in its log queuing terminator to send create, sending create terminator v2 request at 22:19

Update: now all services are down, posting new log which contains old log too, both routers failed.

events.log (5.3 MB)

Hi @SynthwaveFox

I looked at the events log and it shows that terminators for services with failing circuits were deleted but then recreated, generally within about 15 seconds.

Can you extract a portion of the router logs related to terminator deletes and creates?

Here's the extracted data (done with AI, so accuracy not guaranteed)

  ┌────────────────────────┬──────────────┬────────────────────────┬──────────────┬──────────────┬────────────────────────┬──────────────┬─────────────────┐                             
  │        Service         │ Deleted From │  Terminator (deleted)  │ Delete Time  │ Recreated On │  Terminator (created)  │ Create Time  │ Router Changed? │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 2DZDY162IBj4QLuJeP4xz4 │ 49rCu.LqhF   │ 72o0Pfu1K41ZZEhWQQJuq3 │ 18:53:20.755 │ 49rCu.LqhF   │ 1AzVEb7E0hMmQlFv3xeCZp │ 18:53:28.100 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 2HkIwlvmaLUla3bwaPhWYH │ 49rCu.LqhF   │ 6kRkQTnZlCN7Lu1ro5BoEc │ 18:53:20.755 │ 49rCu.LqhF   │ 1DqhbkUDvChkdwmQfctSou │ 18:53:31.640 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 2NpCpiEXC3Vum8g4zJJhwr │ 49rCu.LqhF   │ 2GmFpPxYZNbDRUCpC37ZRC │ 18:53:20.754 │ 49rCu.LqhF   │ 4EsdUzmt0tJbwaG9ySvx6e │ 18:53:27.244 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 47WpwqnChgdhXpj556zroc │ 49rCu.LqhF   │ 5eECuf7obaEuOcp7wKwmwZ │ 18:53:20.755 │ 49rCu.LqhF   │ 2LGRZ2BW7qoLlHnQQMPvfy │ 18:53:26.500 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 4WCSmnhcWWi36Tzo8I55ws │ 49rCu.LqhF   │ 2OwPWfaiHGKmf9VNz5ha2e │ 18:53:20.754 │ 49rCu.LqhF   │ 215fFaA1QWF1Z3R5erW7qN │ 18:53:25.357 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ u3uwNuLTqXMAts8FXQd70  │ 49rCu.LqhF   │ 7YwwetipmaGFsZDk6A654R │ 18:53:20.755 │ 49rCu.LqhF   │ 4OdGV5nAONtmCP3puxM54  │ 18:53:26.441 │ No              │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 6RrbELBQD76ch9VCUcbEi4 │ 49rCu.LqhF   │ 1CoVcm65SlhedKopS7Tggg │ 18:53:20.754 │ -8K2Nx7TvC   │ 5j52c9XPxCXLF98LNSTEXO │ 18:53:30.357 │ Yes             │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 6sVz5CA4LSrNTWsggeWTMJ │ 49rCu.LqhF   │ 4Y3NAsmirVLDvi2gH7j79A │ 18:53:20.754 │ -8K2Nx7TvC   │ 36l9sZmL47ybMfBkgcngjd │ 18:53:27.269 │ Yes             │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ 7ZXoAF9wzqKJpwymnd1R8p │ 49rCu.LqhF   │ 1ZlGPlktpJyp9QB3A3qQ7X │ 18:53:20.754 │ -8K2Nx7TvC   │ 7In98X9jR58rFWJ7SKj5EL │ 18:53:27.378 │ Yes             │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ BSNN23X2bcUbyUypxxOT3  │ 49rCu.LqhF   │ 5uw0880t5c8GVHKbIRx0a2 │ 18:53:20.755 │ -8K2Nx7TvC   │ 1egqetn4WnLzXwxOeWvVwl │ 18:53:31.095 │ Yes             │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ Fn0rp0o6XHu52TA8U0hxV  │ 49rCu.LqhF   │ 61gTEjl5qPjegrsb0tWbzF │ 18:53:20.754 │ -8K2Nx7TvC   │ 6FMlCq8axlcSC5x6c7Ne0P │ 18:53:29.901 │ Yes             │
  ├────────────────────────┼──────────────┼────────────────────────┼──────────────┼──────────────┼────────────────────────┼──────────────┼─────────────────┤
  │ rRCnMeQ4xiRRi2YYkbWrG  │ 49rCu.LqhF   │ 7QEDn68F62MFqNUoGye7Mu │ 18:53:20.754 │ -8K2Nx7TvC   │ 2ApQF0HTPhm4qZrOwCqHtT │ 18:53:28.281 │ Yes             │
  └────────────────────────┴──────────────┴────────────────────────┴──────────────┴──────────────┴────────────────────────┴──────────────┴─────────────────┘

  All deletes happened within the same millisecond (~18:53:20.754-755). Recreate times range from 5-11 seconds later.

  Daily rotations — 2O39NGFrX7lKPRsu75VZNw (host ZL-pmuFP7c)

  ┌────────────────────────┬─────────────────────┬──────────────────┬────────────────────────┬─────────────────────┬──────────────────┬──────┐
  │  Terminator (deleted)  │     Delete Time     │ Router (deleted) │  Terminator (created)  │     Create Time     │ Router (created) │ Gap  │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ 7RxO5axCWwO9vvWr0hokKp │ Feb 13 05:00:02.021 │ 49rCu.LqhF       │ 5i8Ha9doSqE71Y7AKFEaYr │ Feb 13 05:00:04.937 │ -8K2Nx7TvC       │ 2.9s │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ 5i8Ha9doSqE71Y7AKFEaYr │ Feb 14 05:00:01.119 │ -8K2Nx7TvC       │ 3nDjKwSCyFRpSOSiriTOkC │ Feb 14 05:00:03.692 │ 49rCu.LqhF       │ 2.6s │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ 3nDjKwSCyFRpSOSiriTOkC │ Feb 15 05:00:01.309 │ 49rCu.LqhF       │ 3KaPCwgtxioInND6sxYyW8 │ Feb 15 05:00:04.240 │ -8K2Nx7TvC       │ 2.9s │
  └────────────────────────┴─────────────────────┴──────────────────┴────────────────────────┴─────────────────────┴──────────────────┴──────┘

  Daily rotations — Fn0rp0o6XHu52TA8U0hxV (host lvI2GtZDU)

  ┌────────────────────────┬─────────────────────┬──────────────────┬────────────────────────┬─────────────────────┬──────────────────┬──────┐
  │  Terminator (deleted)  │     Delete Time     │ Router (deleted) │  Terminator (created)  │     Create Time     │ Router (created) │ Gap  │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ 6FMlCq8axlcSC5x6c7Ne0P │ Feb 13 10:00:01.748 │ -8K2Nx7TvC       │ 6NZwJbQUzkUUAEv5yktuNa │ Feb 13 10:00:02.676 │ -8K2Nx7TvC       │ 0.9s │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ 6NZwJbQUzkUUAEv5yktuNa │ Feb 14 10:00:01.186 │ -8K2Nx7TvC       │ xEZkwxAusxPFQ6LEdMIrL  │ Feb 14 10:00:02.094 │ 49rCu.LqhF       │ 0.9s │
  ├────────────────────────┼─────────────────────┼──────────────────┼────────────────────────┼─────────────────────┼──────────────────┼──────┤
  │ xEZkwxAusxPFQ6LEdMIrL  │ Feb 15 10:00:01.627 │ 49rCu.LqhF       │ 5IvxHVIDHuKYyV0jNDGw7Y │ Feb 15 10:00:02.585 │ -8K2Nx7TvC       │ 1.0s │
  └────────────────────────┴─────────────────────┴──────────────────┴────────────────────────┴─────────────────────┴──────────────────┴──────┘

It doesn't show much downtime. I think the question is why are terminators being deleted and then recreated? How are you hosting the service and what is causing it to move from router to router?

From what you had said earlier, it sound like you were hosting with the ER/T, but this feels more like tunneler or SDK based hosting.

Paul

Maybe I don’t know what I’m talking about. To give exact detail, I set everything up according to the Zrok self-hosting guide. Self-Hosting Guide for Linux | NetFoundry Documentation .

As for the repeated timing, I believe I have a few of my zrok shares set to restart at 05:00 by systemd restarting (and it could be 10:00 as well because of time zone misconfiguration between machines). Sometimes with TCP services, the share goes “stale” and gets high ping and low throughput, restarting the zrok share service has been all needed to fix it.

Feb 18 05:00:01 MNZ-Zrok-2.5G ziti[1347]: {"file":"github.com/openziti/ziti/router/xgress_edge/hosted.go:472","func":"github.com/openziti/ziti/router/xgress_edge.(*hostedServiceRegistry).Remove","level":"info","msg":"terminator removed from router set","reason":"controller delete success","terminatorId":"3aeHdaT1glM6K7faT5na2k","time":"2026-02-18T05:00:01.663Z"}
Feb 18 10:00:02 MNZ-Zrok-2.5G ziti[1347]: {"file":"github.com/openziti/ziti/router/xgress_edge/hosted.go:472","func":"github.com/openziti/ziti/router/xgress_edge.(*hostedServiceRegistry).Remove","level":"info","msg":"terminator removed from router set","reason":"controller delete success","terminatorId":"7Vqxo9osYpJ2aRPsuRbptc","time":"2026-02-18T10:00:02.036Z"}

I do indeed see in the router logs the terminator removal at those times.

The downtime is an interesting point though. I’m sure that there is minimal downtime, but sometimes the services don’t recover for a few hours. I would be troubleshooting something else and a router would randomly decide to start functioning, and that’s the weirdness I’ve been dealing with. The shares which automatically restart are not the ones I’m tracking downtime issues with. As before mentioned, if the routers start having issues, then every share goes down.

I have probably 20 total net shares running at once, some TCP, some web.

At 10:00:02 the router logged that the existing terminator was closed:

updated state newState:3 oldState:2 reason:"channel closed"
terminatorId: 7Vqxo9osYpJ2aRPsuRbptc
terminator removed from router set (controller delete success)


Immediately (~900ms later), a new terminator was created and established:

establishing terminator
terminatorId: aMcYDSOdu9amVhkRdvkCA

sending create terminator v2 request
terminator established


About ~24 seconds later, a circuit attempt failed:

sending dial request to sdk
ERROR: cannot forward payload, no destination for circuit
closing while buffer contains unacked payloads
reported forwarding faults
circuit unrouted

Right now, I haven’t touched the routers or restarted anything and things are still working for the moment.

Hi @SynthwaveFox , I apologize, I forgot or didn't notice that you had mentioned zrok at the top, so that clarifies a lot about your deployment.

I feel like we're still a ways off from understanding what's happening here. It doesn't seem like a terminators issue, since they seem relatively stable. There are some related dial failures.

Is the event log representative of a total router failure? I ran that through AI analysis (so again, not guaranteed to be correct), it only reported 28 failures out of 4228 dial attempts, half no terminators half timeouts.

ERROR: cannot forward payload, no destination for circuit
closing while buffer contains unacked payloads
reported forwarding faults

This is interesting. This can happen as part of normal operation when a circuit is being torn down. The end-of-circuit message might not get acked before the circuit is torn down, which is fine, since the circuit is being closed. However, if you see dramatically more of these when network becomes non-responsive, that would be interesting.

So, based on the events log, it seems like dial failures are not the primary symptom, but rather circuits are failing after they are setup?

Do failures correlate with high bandwidth traffic, maybe? There is a known issue where existing high throughput circuits can cause problems when establishing other circuits. There is a fix for it, which requires some optional config to be turned on, see openziti/zrok v1.0.8 on GitHub

Can you go into more detail about what the failure case looks like? What specific errors do you see from the client, what errors show up more frequently in the logs? I had assumed dial failures but since the event log doesn't show that, let's get more details and fill in the bigger picture.

Thank you,
Paul

It may not be a total failure that happened in the log. To be honest, I haven’t seen a complete router meltdown in a bit. I do just think its weird that occasional outages occur. Is this normal behavior to see circuits removed and created?

As for high-bandwidth traffic, it’s possible. One of my network shares is a file server (which sees light, random usage).

I had UptimeRobot pinging my websites every 5 minutes to check for outages, and randomness appeared across the board. Outages would probably be about 3-4 days apart from each other at its worst, sometimes a week. When it occurs, the websites were replaced with the zrok “bad gateway” page. I could still access the main zrok dashboard as normal. Creating a new network share would fail on any new devices, environments, etc., but I can’t remember the error it gave. I think it was something about failing to create a terminator. Usually this would result in me just restarting the entire ziti machine, which would fix it for some time until it would fail again.

It was at this moment when I looked into adding a second edge router, which seems to have sort of alleviated the issue. As it stands right now, I stopped the OneUptime monitors from pinging my services due to the annoying notifications, and all my sites are still available for the moment.