Fabric „crashes“ with thousands of circuits and idle threshold warnings

There's also a new 'maxIdleTime' setting per-service, since 0.31.3. See Configurable Timer needed to close idle circuits · Issue #1496 · openziti/ziti · GitHub for details. It looks like ZET may still need some tweaks, but if you want to try the new settings, it should keep the idle circuits under control from the controller.

Cheers,
Paul

1 Like

Thanks, will try out the new release!

thank you for reporting your findings.

can you do the following:

  • let it run for a while to accumulate circuits.
  • do ziti-edge-tunnel dump on both sides (intercepting and hosting). this will create a file with information about current (according to ZET) state.
1 Like

I just noticed a minor error in the dump command that Eugene provided. You need to run it like this to get the dump output to a file:

ziti-edge-tunnel dump -p /directory/where/dump/output/will/be/created

Without -p, the dump contents will be emitted to the log and truncated at 1k.

-Shawn

1 Like

I'm getting failed to connect: -111/connection refused on one host, even though I'm running the command as root and on the other host I'm getting a success message but no output is created:

dmuensterer@zabbix:~$ sudo ziti-edge-tunnel dump -p /tmp/ziti-edge-tunnel-dump-bind/
received response <{"Success":true,"Code":0}
>

On the host where the dump reports success, are there any files in the directory that you specified?

Does a ziti group exist the host that reports connection refused?

Yes, group exists but the folder is not created. If I create the folder, it's empty:

dmuensterer@zabbix:~$ sudo ziti-edge-tunnel dump -p /tmp/ziti-edge-tunnel-dump-bind/
received response <{"Success":true,"Code":0}
>
dmuensterer@zabbix:~$ less /etc/group | grep ziti
ziti:x:995:
dmuensterer@zabbix:~$ ls -la /tmp/ziti-edge-tunnel-dump-bind
ls: cannot access '/tmp/ziti-edge-tunnel-dump-bind': No such file or directory
dmuensterer@zabbix:~$ mkdir /tmp/ziti-edge-tunnel-dump-bind
dmuensterer@zabbix:~$ sudo ziti-edge-tunnel dump -p /tmp/ziti-edge-tunnel-dump-bind/
received response <{"Success":true,"Code":0}
>
dmuensterer@zabbix:~$ ls -la /tmp/ziti-edge-tunnel-dump-bind
total 8
drwxr-xr-x  2 dmuensterer dmuensterer 4096 Dec 20 12:43 .
drwxrwxrwt 13 root        root        4096 Dec 20 12:43 ..
dmuensterer@zabbix:~$ 

We definitely need to clean up the dump command. In the meantime which user/group is the ziti-edge-tunnel process running as? Does that user have write permission to drwxr-xr-x 2 dmuensterer dmuensterer ?

1 Like

Thank you, I've upgraded to v.0.31.4 and set up the service for a max idle time of 60 seconds but the circuits remain open:

ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ziti edge list services --verbose
... a lot of other services ....
{
         "_links": {
            "configs": {
               "href": "./services/10zAARGgspIcy0l9nZgGGf/configs"
            },
            "self": {
               "href": "./services/10zAARGgspIcy0l9nZgGGf"
            },
            "service-edge-router-policies": {
               "href": "./services/10zAARGgspIcy0l9nZgGGf/service-edge-router-policies"
            },
            "service-policies": {
               "href": "./services/10zAARGgspIcy0l9nZgGGf/service-policies"
            },
            "terminators": {
               "href": "./services/10zAARGgspIcy0l9nZgGGf/terminators"
            }
         },
         "createdAt": "2023-12-20T18:56:52.154Z",
         "id": "10zAARGgspIcy0l9nZgGGf",
         "tags": {},
         "updatedAt": "2023-12-20T18:56:52.154Z",
         "config": {},
         "configs": [
            "2sKPnrRnsT60sYhh0PAlgn",
            "7lWEc0UEm56NE1cxbceoV1"
         ],
         "encryptionRequired": true,
         "maxIdleTimeMillis": 60000,
         "name": "zabbix_agent.svc",
         "permissions": [
            "Bind",
            "Dial"
         ],
         "postureQueries": [],
         "roleAttributes": null,
         "terminatorStrategy": "smartrouting"
      }

The circuits still build up:

ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ziti fabric list circuits | grep zabbix_agent.svc | wc -l
1519
ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ziti fabric list circuits
....
│ xqjF4I.Pl  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 565JRp8it1jFJDNIgyaxGB │ 2023-12-20 19:22:01 │ r/zt-edge-router                                                               │
│ xsyQiP.I9  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 6Tck1fFyAYxyq7K4N9z1oC │ 2023-12-20 21:08:36 │ r/zt-router-1.company.ziti                                                 │
│ xuIzrPdI9  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 565JRp8it1jFJDNIgyaxGB │ 2023-12-20 20:05:40 │ r/zt-edge-router                                                               │
│ y3Vb4IdI9  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 6Tck1fFyAYxyq7K4N9z1oC │ 2023-12-20 19:29:46 │ r/zt-router-1.company.ziti                                                 │
│ y5vfIIdP9  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 565JRp8it1jFJDNIgyaxGB │ 2023-12-20 20:31:03 │ r/zt-router-1.company.ziti -> l/7QFdTGEppWIoLwebu2o9lV -> r/zt-edge-router │
│ y60N.I.Pl  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 565JRp8it1jFJDNIgyaxGB │ 2023-12-20 21:32:40 │ r/zt-router-1.company.ziti -> l/7QFdTGEppWIoLwebu2o9lV -> r/zt-edge-router │
│ yA9BQP.Pl  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 565JRp8it1jFJDNIgyaxGB │ 2023-12-20 20:23:35 │ r/zt-edge-router                                                               │
│ yE9WmPdIl  │ clqe4yaak0mevdk7oh1vjr0z5 │ zabbix_agent.svc                    │ 6Tck1fFyAYxyq7K4N9z1oC │ 2023-12-20 19:11:29 │ r/zt-router-1.company.ziti     
ziti@zt:~/.ziti/quickstart/zt/ziti-bin$ ziti fabric inspect circuit yE9WmPdIl
Results: (1)
UegPIgBsY.circuit:yE9WmPdIl
circuitId: yE9WmPdIl
forwards:
  5pw8: epAl
  epAl: 5pw8
linkDetails: {}
xgressDetails:
  5pw8:
    address: 5pw8
    flags: "0"
    goroutines: null
    linkSendBufferPointer: "0xc0035b0480"
    originator: Initiator
    recvBufferDetail:
      acquiredSafely: true
      lastSizeSent: 120
      maxSequence: 2
      nextPayload: none
      payloadCount: 0
      sequence: 2
      size: 0
    sendBufferDetail:
      accumulator: 287
      acquiredSafely: true
      blockedByLocalWindow: false
      blockedByRemoteWindow: false
      closeWhenEmpty: false
      closed: false
      duplicateAcks: 0
      linkRecvBufferSize: 0
      linkSendBufferSize: 0
      retransmits: 0
      retxScale: 1.5
      retxThreshold: 200
      successfulAcks: 4
      timeSinceLastRetx: 3h8m1.215s
      windowSize: 16384
    sequence: 4
    timeSinceLastLinkRx: 3h8m1.212s
    xgressPointer: "0xc0153441c0"
  epAl:
    address: epAl
    flags: "10"
    goroutines: null
    linkSendBufferPointer: "0xc002b6c500"
    originator: Terminator
    recvBufferDetail:
      acquiredSafely: true
      lastSizeSent: 0
      maxSequence: 3
      nextPayload: none
      payloadCount: 0
      sequence: 3
      size: 0
    sendBufferDetail:
      accumulator: 144
      acquiredSafely: true
      blockedByLocalWindow: false
      blockedByRemoteWindow: false
      closeWhenEmpty: false
      closed: false
      duplicateAcks: 0
      linkRecvBufferSize: 120
      linkSendBufferSize: 0
      retransmits: 0
      retxScale: 1.5
      retxThreshold: 200
      successfulAcks: 3
      timeSinceLastRetx: 3h8m1.212s
      windowSize: 16384
    sequence: 3
    timeSinceLastLinkRx: 3h8m1.209s
    xgressPointer: "0xc000bdc8c0"

Aha, I'm now able to dump on the bind host. I thought since I was running ziti-edge-tunnel as root this was sufficient but after chowing the directory to ziti:ziti it created a file successfully.
@ekoby I've send you the dump via PM!

For the dial side, the same command and permissions still leave me with

dmuensterer@bastion:~$ sudo ziti-edge-tunnel dump -p /tmp/ziti-edge-tunnel-dump-bind/
failed to connect: -111/connection refused

Interesting that ziti dump shows nothing that would indicate stale/leaked connections

Connections:
conn[2078]: server service[zabbix_agent.svc] terminators[2]
conn[683]: server service[DeltaSecure_Allow_Zabbix_Agent_10051] terminators[2]
conn[1]: server service[wildcard.80.web] terminators[2]
	child[4253]: state[CloseWrite] caller_id[dm_mb] ch[2] zt-router-1.deltasecure.ziti
	child[4252]: state[CloseWrite] caller_id[dm_mb] ch[2] zt-router-1.deltasecure.ziti
conn[0]: server service[wildcard.ssh] terminators[2]
	child[4242]: state[Connected] caller_id[dm_mb] ch[3] zt-edge-router

just based on the connecton IDs and assuming your ssh sessions is 4242, there are no connections that are older than it

Correct, 4242 is for the ssh sessions.
Any further data I can provide to troubleshoot?
Were you able telling by the pcap what the issue in the traffic is which is causing this behaviour?

Thanks!

Hello, are you seeing messages in the router(s) that look like:

circuit exceeds idle threshold

That'll let us know if ziti considers them idle circuits, or if there is still traffic flowing over them.

Thank you,
Paul

Yes, I’m seeing lots of exactly those warnings.

Are there any files in /tmp/.ziti?

$ sudo ls -al /tmp/.ziti/
total 0
drwxr-x---.  2 root ziti  80 Dec 21 08:11 .
drwxrwxrwt. 24 root root 640 Dec 21 08:11 ..
srwxrwxrwx.  1 root ziti   0 Dec 21 08:11 ziti-edge-tunnel-event.sock
srwxrwxrwx.  1 root ziti   0 Dec 21 08:11 ziti-edge-tunnel.sock

These would be the domain sockets that are created by the ziti-edge-tunnel server process. It will only create this directory and the domain sockets within if they can be created with the ziti group. It insists on using the ziti group to avoid requiring things like electron UIs to run as root strictly for access to the domain sockets.

I'm guessing you have a ziti group on the intercepting system, since you say you're using the same permissions (and I assume ownership) of the dump directory as on the hosting tunneler's host. Did the ziti group on this system exist when the server ziti-edge-tunnel was started? If not you'll need to restart the process to have the domain sockets created.

packet capture looked normal -- no leaked/stale connections -- which is consistent with the output from ziti-edge-tunnel dump. These findings narrow the cause of the problems to communication between ZET and ER.
In the normal flow ZET sends ConnectionClosed message to ER and ER tears down the circuit. So it's either ZET failing to send the message or ER failing to process it.

Have you updated both sides to latest ZET release?

That’s interesting.
Yes both sides are on the newest ZET, would a dump on the router help?

Are you seeing any messages that look like

removing idle circuit, idle time of X exceedes max idle time of Y"

If not, did you upgrade the routers, or just the controller? If you upgraded both, would you will to try running the controller with verbose output? There are several debug messages that would tell us more about why idle circuits aren't being terminated.

Thank you,
Paul

1 Like

Ah, sorry about that - I missed the routers and only upgraded the controller!
Works now, the circuits seem to be getting closed - no more idle circuits building up... Thanks for the help.

As for the ZET/ZR connection, if I can help here further narrow down the problem please let me know.
Such awesome work everyone of you at NetFoundry does and I'd love to help!

1 Like

That's great! I'm assuming ekoby will continue to dig into the ZET/router disconnect and will reach out if necessary, but likely not until the new year. Appreciate your persistence and assistance :slight_smile:

Cheers!
Paul

1 Like