Fabric „crashes“ with thousands of circuits and idle threshold warnings

We’ve just today discovered a problem that I believe is caused by wildcard services.
After creating the service as described by @scareything, we noticed (a lot!!) of „circuit idle threshold exceeded“ logs. A couple minutes after creation it was approximately 4000/minute but raising pretty quickly. As far as I know we didn’t do anything „special“ after creating the service like renaming or such.
A restart of the controller temporarily fixed the problem.
However this morning we ended up with a couple hundred thousand logs and significant impact on the whole fabric. We also observed that the circuitCount showed ~18000.

Any ideas on what this could be caused by? What additional info can I provide? Are there logs that associate a circuit to a service?
Running the newest Ziti v0.31.0

Yikes! It's a light week here for the OpenZiti team fyi. I'd start by collecting the logs from the controller and the routers if you can. You can email them to help at openziti.org or share them some other way if you feel comfortable with that. If not, does removing the service help? Can you put it back and recreate the problem?

It always helps developers tremendously if you can reliably reproduce the problem with associated steps, if you have them?

1 Like

Thanks. I've mailed the logs.
The problem is gone after deleting the service - but not the wildcard service as I suspected which makes it even more unclear to me. Gosh, I hate problems I can't reproduce.
Maybe you guys find something in the logs or I will find a way to reproduce in the future. :slight_smile:

One thing I've just thought of:
I've created two services, one wildcard "many to one" service and one non-wildcard service to allow for traffic in the other direction, I've just deleted the non-wildcard service.
Both services had the same role attribute. Is this a problem?

The logs seem to only contain the warning: "circuit exceeds idle threshold"

I see the logs are from:
Journal begins at Wed 2023-11-22 09:40:37 CET, ends at Wed 2023-11-22 16:05:57 CET.

Can you go back a day and find the few minutes before all those messages start flying?

Thanks, the journal was rotated, but I was able to reproduce this:

./ziti edge create config zabbix.agents.cfg.intercept intercept.v1 '{
    "addresses": ["*.ziti"],
    "protocols": ["tcp"],
    "portRanges": [ {"low":10050,"high":10050} ],
    "dialOptions": { "identity": "$dst_hostname" }
}'
./ziti edge create config zabbix.agents.cfg.host host.v1 '{
    "address": "127.0.0.1",
    "protocol": "tcp",
    "port": 10050,
    "listenOptions": { "identity": "$tunneler_id.name" }
}'
./ziti edge create service zabbix.agents \
    --configs zabbix.agents.cfg.intercept,zabbix.agents.cfg.host \
    --role-attributes zabbix_agent
./ziti edge create service-policy zabbix.agents.dial Dial --identity-roles "@zabbix.ziti" --service-roles "@zabbix.agents"
./ziti edge create service-policy zabbix.agents.bind Bind --identity-roles "#zabbix_agent" --service-roles "@zabbix.agents"

What exactly does circuit exceeds idle threshold mean? No usage of the circuit? What's the result of the warning? Is the circuit closed afterwards? How is the threshold set?

You can see pretty well how the amount of logs build up steadily:

I've emailed you the recent logs again from 16:46. The service was created around ~16:47.

Thanks!

If you like, I'm also open to execute the commands together or do a troubleshooting screen-sharing session if you're unable to reprocude yourself :slight_smile:

Hi, can I do anything to help troubleshoot or pinpoint the issue?

That kind of dial rate usually indicates you are capturing your own traffic. That is, the same device is hosting and dialing the service. The loop forms as the traffic is initiated, intercepted, emitted by the same device (hosting), recaptured, wash, rinse, repeat. The bestway to start looking is the policy-advisor. For the services in question, run the policy advisor and see if any device has both Bind and Dial permissions. That is usually an indicator of the problem config. You can have valid configs of that, especially with addressable terminators, but it isn't very common.

If it isn't that, it would seem that you have something not closing the circuit properly. Looking at the actual traffic that is being intercepted, you might see a particular application that is not closing connections well. The idle circuits will accrete and are emitted per minute, so they will rise over time even if the rate of creation is linear, as they don't age out, and are relogged each minute they remain.

1 Like

Thanks, I'll check that!

That probably isn't the problem as only a Ziti Edge Tunnel is running on the particular machine. I'll check and give you an update.

Hi @dmuensterer ,
Idle circuits are circuits which are established but haven't had any traffic flowing on them. When we detect these, we check with the controller to make sure they are still valid. If the controller controller doesn't know about them, it will let the router know they can be cleaned up.

We don't currently close them, as there are some use cases where users have long lived connections that only send traffic very sporadically.

Orphan circuits happen sometimes when circuits are re-routed, as the old path isn't cleaned up immediately, rather we let the this GC process handle the cleanup.

The settings are configurable in the router config file.

forwarder:
    # values specified in milliseconds
    idleCircuitTimeout: 60000 # how long to wait before checking 
    idleTxInterval: 60000 # how often to scan for idle circuits

If this turns out not to be the issue that @mike.gorman highlighted as a possibility (of the tunneler intercepting its own traffic), let us know and we'll dig into it further. If you can make it happen reliably, we should be to track it down.

One tool we can use is circuit inspections.

ziti fabric inspect circuit --include-stacks <circuit id>

If you run that, that will let us know what the routers know about the circuit and will help us narrow down where the issue is. We should be able to tell if the routers have a valid circuit (which would indicate a problem with the tunneler or external application(s)) or if some close notifications are getting lost and the problem is in the fabric.

Cheers,
Paul

1 Like

Indeed! Thanks a lot for the help.

OKAY : zabbix.ziti (2) -> wildcard.ssh (2) Common Routers: (2/2) Dial: Y Bind: Y

And so easy to troubleshoot if you know where to look :slight_smile:
That might be something where an alert could be displayed in ZAC even though there might be edge cases where this is valid?

Is there a way to close them manually? Or just via a controller restart?

You can use

ziti fabric delete circuit <circuit id>

Unfortunately we don't have querying implemented for the purely in-memory types, like circuits, so you can't use a filter with the delete.

Paul

1 Like

Okay, even though there was a service with both Dial and Bind permissions, after deleting them the problem still persists after deleting them:

ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ./ziti edge policy-advisor services -q | grep "Dial: Y Bind: Y"
ziti@zt:~/.ziti/quickstart/zt/ziti-bin$

I have thousands of similar circuits, as expected from the wildcard service:

ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ./ziti fabric list circuits
| y28zHu5Cc  │ clpiggpnd000lj37o7cx016pb │ zabbix.agents                            │ 7JxFBukz9zmSLnfnZNYvr0 │ r/zt-edge-router                                                               │
│ y2CzXu5yc  │ clpiggpnd000lj37o7cx016pb │ zabbix.agents                            │ 7JxFBukz9zmSLnfnZNYvr0 │ r/zt-edge-router                                                               │
│ y5g3HubCc  │ clpiggpnd000lj37o7cx016pb │ zabbix.agents                            │ 6CVr2Pt27RUjyDDNtK4Edm │ r/zt-router-1.ziti                                                 │
│ yDXlHD5Cc  │ clpiggpnd000lj37o7cx016pb │ zabbix.agents                            │ 1i6n3KQZGcAvBOJDRILvSI │ r/zt-router-1.ziti                                                 │
│ yJ.yHDbyh  │ clpiggpnd000lj37o7cx016pb │ zabbix.agents                            │ 59KzKo2Uk8tnPMGygHmd0Y │ r/zt-router-1.ziti                                                 │
ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ./ziti edge policy-advisor services -q | grep zabbix.agents
OKAY : el01.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : bastion.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : soar.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : gl01.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : zt-router-1.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : support.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : zt.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: N Bind: Y
OKAY : zabbix.ziti (2) -> zabbix.agents (2) Common Routers: (2/2) Dial: Y Bind: N

Here's a stack:

ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ./ziti fabric inspect circuit --include-stacks y28zHu5Cc
Results: (1)
YZanu3dvYy.circuitAndStacks:y28zHu5Cc
circuitId: y28zHu5Cc
forwards:
  VDGO: dyM6
  dyM6: VDGO
linkDetails: {}
xgressDetails:
  VDGO:
    address: VDGO
    flags: "10"
    goroutines:
    - "goroutine 4875782 [select]:\ngithub.com/openziti/ziti/router/xgress.(*LinkSendBuffer).run(0xc02c6c1780)\n\tgithub.com/openziti/ziti/router/xgress/link_send_buffer.go:217
      +0x3ae\ncreated by github.com/openziti/ziti/router/xgress.NewLinkSendBuffer\n\tgithub.com/openziti/ziti/router/xgress/link_send_buffer.go:116
      +0x1f8\n\n"
    - "goroutine 4875774 [select, 4 minutes]:\ngithub.com/openziti/ziti/router/xgress_edge.(*baseMsgQ).Pop(0xc003c0df50)\n\tgithub.com/openziti/ziti/router/xgress_edge/msg_queue.go:46
      +0x6b\ngithub.com/openziti/ziti/router/xgress_edge.(*edgeXgressConn).ReadPayload(0xc02c6c0e80)\n\tgithub.com/openziti/ziti/router/xgress_edge/fabric.go:193
      +0x178\ngithub.com/openziti/ziti/router/xgress.(*Xgress).rx(0xc002401dc0)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:519
      +0x3e4\ncreated by github.com/openziti/ziti/router/xgress.(*Xgress).payloadIngester\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:382
      +0x230\n\n"
    - "goroutine 4875783 [select, 4 minutes]:\ngithub.com/openziti/ziti/router/xgress.(*Xgress).nextPayload(...)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:416\ngithub.com/openziti/ziti/router/xgress.(*Xgress).tx(0xc002401dc0)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:447
      +0x348\ncreated by github.com/openziti/ziti/router/xgress.(*Xgress).Start\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:241
      +0x2bf\n\n"
    linkSendBufferPointer: "0xc02c6c1780"
    originator: Terminator
    recvBufferDetail:
      acquiredSafely: true
      lastSizeSent: 0
      maxSequence: 3
      nextPayload: none
      payloadCount: 0
      sequence: 3
      size: 0
    sendBufferDetail:
      accumulator: 62
      acquiredSafely: true
      blockedByLocalWindow: false
      blockedByRemoteWindow: false
      closeWhenEmpty: false
      closed: false
      duplicateAcks: 0
      linkRecvBufferSize: 38
      linkSendBufferSize: 0
      retransmits: 0
      retxScale: 1.5
      retxThreshold: 200
      successfulAcks: 3
      timeSinceLastRetx: 3m54.816s
      windowSize: 16384
    sequence: 3
    timeSinceLastLinkRx: 3m54.812s
    xgressPointer: "0xc002401dc0"
  dyM6:
    address: dyM6
    flags: "0"
    goroutines:
    - "goroutine 4875772 [select, 4 minutes]:\ngithub.com/openziti/ziti/router/xgress_edge.(*baseMsgQ).Pop(0xc0093865e8)\n\tgithub.com/openziti/ziti/router/xgress_edge/msg_queue.go:46
      +0x6b\ngithub.com/openziti/ziti/router/xgress_edge.(*edgeXgressConn).ReadPayload(0xc0060c9a80)\n\tgithub.com/openziti/ziti/router/xgress_edge/fabric.go:193
      +0x178\ngithub.com/openziti/ziti/router/xgress.(*Xgress).rx(0xc002520d20)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:519
      +0x3e4\ncreated by github.com/openziti/ziti/router/xgress.(*Xgress).Start\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:239
      +0x27c\n\n"
    - "goroutine 4875773 [select, 4 minutes]:\ngithub.com/openziti/ziti/router/xgress.(*Xgress).nextPayload(...)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:416\ngithub.com/openziti/ziti/router/xgress.(*Xgress).tx(0xc002520d20)\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:447
      +0x348\ncreated by github.com/openziti/ziti/router/xgress.(*Xgress).Start\n\tgithub.com/openziti/ziti/router/xgress/xgress.go:241
      +0x2bf\n\n"
    - "goroutine 4875771 [select]:\ngithub.com/openziti/ziti/router/xgress.(*LinkSendBuffer).run(0xc0060c9c80)\n\tgithub.com/openziti/ziti/router/xgress/link_send_buffer.go:217
      +0x3ae\ncreated by github.com/openziti/ziti/router/xgress.NewLinkSendBuffer\n\tgithub.com/openziti/ziti/router/xgress/link_send_buffer.go:116
      +0x1f8\n\n"
    linkSendBufferPointer: "0xc0060c9c80"
    originator: Initiator
    recvBufferDetail:
      acquiredSafely: true
      lastSizeSent: 38
      maxSequence: 2
      nextPayload: none
      payloadCount: 0
      sequence: 2
      size: 0
    sendBufferDetail:
      accumulator: 77
      acquiredSafely: true
      blockedByLocalWindow: false
      blockedByRemoteWindow: false
      closeWhenEmpty: false
      closed: false
      duplicateAcks: 0
      linkRecvBufferSize: 0
      linkSendBufferSize: 0
      retransmits: 0
      retxScale: 1.5
      retxThreshold: 200
      successfulAcks: 4
      timeSinceLastRetx: 472551h10m38.973s
      windowSize: 16384
    sequence: 4
    timeSinceLastLinkRx: 3m54.682s
    xgressPointer: "0xc002520d20"

Any ideas?
Maybe one more hint on the setup to recreate:
There's one wildcard service that allows SSH access from #admin to #ssh
There's another wildcard service that allows for TCP/10050 from @zabbix.ziti to #zabbix_agent
@zabbix.ziti also has the attribute #ssh.

Is this a problem? Looks like the two wildcard services somehow interfere with each other?

Hi, I've found a way to reproduce by using 2 wildcard services with the following 3 identities and attributes.
And a small explanation of what I'm trying to achieve:
#admin should be able to dial all *.ziti addresses to access port 22.
#ssh should let out the traffic on port 22 dialed by #admin
@Monitoring should be able to dial all *.ziti addresses to access Port 10050.
#monitoring_client should let out the traffic on port 10050 dialed by @Monitoring

Identitites:
@Client #admin
@Monitoring #ssh
@Server #ssh #monitoring_client

Code to create services/configs/policies:

ziti edge create config monitoring_clients.cfg.intercept intercept.v1 '{
    "addresses": ["*.ziti"],
    "protocols": ["tcp"],
    "portRanges": [ {"low":10050,"high":10050} ],
    "dialOptions": { "identity": "$dst_hostname" }
}'

ziti edge create config monitoring_clients.cfg.host host.v1 '{
    "address": "127.0.0.1",
    "protocol": "tcp",
    "port": 10050,
    "listenOptions": { "identity": "$tunneler_id.name" }
}'

ziti edge create service monitoring_clients \
    --configs monitoring_clients.cfg.intercept,monitoring_clients.cfg.host \
    --role-attributes zabbix_agent
ziti edge create service-policy monitoring_clients.dial Dial --identity-roles "@Monitoring" --service-roles "@monitoring_clients"
ziti edge create service-policy monitoring_clients.bind Bind --identity-roles "#monitoring_client" --service-roles "@monitoring_clients"
ziti edge create config ssh.cfg.intercept intercept.v1 '{
    "addresses": ["*.ziti"],
    "protocols": ["tcp"],
    "portRanges": [ {"low":22,"high":22} ],
    "dialOptions": { "identity": "$dst_hostname" }
}'

ziti edge create config ssh.cfg.host host.v1 '{
    "address": "127.0.0.1",
    "protocol": "tcp",
    "port": 22,
    "listenOptions": { "identity": "$tunneler_id.name" }
}'

ziti edge create service ssh \
    --configs ssh.cfg.intercept,ssh.cfg.host \
    --role-attributes zabbix_agent
ziti edge create service-policy ssh.dial Dial --identity-roles "#admin" --service-roles "@ssh"
ziti edge create service-policy ssh.bind Bind --identity-roles "#ssh" --service-roles "@ssh"

I don't see "#ssh" defined as a service role but used in a service role value. Is that a typo, or is that actually what is in your code?

Thanks! that was a typo!

In the code, or just in the post? I want to be sure we don't need to track down what happens if you enter an attribute that isn't defined.

Hi @dmuensterer
The circuit inspects show a complete, valid circuit in the fabric, so the problem is likely either with the tunnelers or external (or you've got a loop problem). Some applications don't close cleanly, and that can leave dangling circuits.

If you've got a loop issue, you'll see more circuits than you expect. If the number of circuits that's building up looks like it corresponds with the number of connections that have been made, then it's probably a problem with incomplete closes.

We have seen similar issues elsewhere and we're working on the following to mitigate this:

  1. Adding a configurable keepalive in the tunnelers (probably in the service config), so that if there's an incomplete close we detect it.
  2. Adding a configurable, per-service max-idle time, so that idle circuits can get cleaned up automatically.

Cheers,
Paul

1 Like

I did some more digging and it indeed seems like Zabbix is the source of the problem. Zabbix is a monitoring system and we're trying to establish a connection from the monitoring server to the agents.

What exactly do you mean by that? Open TCP connections? Just trying to understand also what an application can do to prevent the problem.

I'll try that, thanks!