Ziti controller does not work

Ziti network is dead

%Cpu(s): 95.5 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.1 si,  1.1 st 
MiB Mem :    970.0 total,     80.3 free,    650.8 used,    375.7 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    319.2 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                      
    764 ziti      20   0 2120344 223416  49088 S 199.4  22.5  10:37.88 ziti    

what can be done

No routers.

Possibly this is because I had zrok 1.0.6 controllers and 1.0.8 access at the same time.

After migration all on zrok 1.0.8 the Ziti controller seems to work

Yet another information: ziti fabric list links

ziti fabric list links 
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ ID                     โ”‚ DIALER โ”‚ ACCEPTOR โ”‚ STATIC COST โ”‚ SRC LATENCY โ”‚ DST LATENCY โ”‚ STATE     โ”‚ STATUS โ”‚ FULL COST โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 3kcrRbEWnYxltDSeQ6fmZ8 โ”‚ msk173  โ”‚ ovh221   โ”‚           1 โ”‚   69585.2ms โ”‚     113.7ms โ”‚ Connected โ”‚     up โ”‚     69699 โ”‚
โ”‚ 4aQuZ6KRzODkmhUqIhyi9o โ”‚ msk173  โ”‚ ovh76    โ”‚           1 โ”‚   69618.9ms โ”‚     121.3ms โ”‚ Connected โ”‚     up โ”‚     69740 โ”‚

ping 69585.2ms

How it can be possible?

This is might be the reason why ziti takes 200% cpu, all available memory and then stops to work (killed by the system)

What is SRC Latency?

The ziti controller fills up the log with these messages:

1 Amqp queue full:

Jul 31 13:35:57 ziti[854]: {"error":"amqp queue full. Message: {\"namespace\":\"fabric.usage\",\"event_src_id\":\"dc\",\"timestamp\":\"2025-07-31T13:35:57.555669132Z\",\"version\":3,\"source_id\":\"LlqloYCyhW\",\"circuit_id\":\"JRMajXKza\",\"usage\":{\"fabric.rx\":138,\"fabric.tx\":51},\"interval_start_utc\":1753968900,\"interval_length\":60,\"tags\":null}","file":"github.com/openziti/ziti/controller/events/formatter.go:45","func":"github.com/openziti/ziti/controller/events.WriterEventSink.AcceptFormattedEvent","level":"error","msg":"failed to output event","time":"2025-07-31T13:35:57.557Z"}

2 Bad certificate:

Jul 31 13:13:29 ziti[854]: {"_context":"tls:0.0.0.0:993","error":"remote error: tls: bad certificate","file":"github.com/openziti/transport/v2@v2.0.167/tls/listener.go:260","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"[::1]:55644","time":"2025-07-31T13:13:29.603Z"}

3 Error creating route:

Jul 31 13:43:51  ziti[854]: {"_channels":["establishPath"],"apiSessionId":"cmdrf5z2b17cdnqigrb50ozzx","attemptNumber":1,"circuitId":"jUieLQTza","file":"github.com/openziti/ziti/controller/network/routesender.go:197","func":"github.com/openziti/ziti/controller/network.(*routeSender).handleRouteSend","level":"warning","msg":"received failed route status from [r/LlqloYCyhW] for attempt [#0] of [s/jUieLQTza] (error creating route for [c/jUieLQTza]: timeout waiting for message reply: context deadline exceeded)","serviceId":"1qckVqdrhmPI7JZoxNjQb2","sessionId":"cmdrf5z5v17cfnqigzldv9juy","time":"2025-07-31T13:43:51.384Z"}
Jul 31 13:43:51 ziti[854]: {"_channels":["selectPath"],"apiSessionId":"cmdrf5z2b17cdnqigrb50ozzx","attemptNumber":1,"circuitId":"jUieLQTza","error":"error creating route for [s/jUieLQTza] on [r/LlqloYCyhW] (error creating route for [c/jUieLQTza]: timeout waiting for message reply: context deadline exceeded)","file":"github.com/openziti/ziti/controller/network/network.go:650","func":"github.com/openziti/ziti/controller/network.(*Network).CreateCircuit","level":"warning","msg":"route attempt for circuit failed","serviceId":"1qckVqdrhmPI7JZoxNjQb2","serviceName":"svc1","sessionId":"cmdrf5z5v17cfnqigzldv9juy","time":"2025-07-31T13:43:51.387Z"}

ziti version v1.5.4

Hi, I put together a quick guide on how to gather basic debug information. How To Gather OpenZiti Diagnostics ยท openziti/ziti Wiki ยท GitHub If youโ€™re running out of memory, Iโ€™d look at gather a memory pprof.

Based on the AMPQP queue full, you might want to check the amqp buffer size in the ziti config. It defaults to 50, but if youโ€™ve got it set to something large, that could be chewing up memory.

Regarding high link latency, when link is new, it starts with a high value for latency, until an actual value is reported. That ensure that we donโ€™t bias towards new links without knowing their latency. If theyโ€™re the only link available, weโ€™ll use them, but otherwise weโ€™ll wait until we have an accurate latency reading.

Regarding the route create error, check the router log to see if thereโ€™s an something going on there.

Paul

Could you please clarify how I can change the default value of the queue size? Letโ€™s say from 50 to 100.

/var/lib/ziti-controller/config.yml

events:
  amqpLogger:
    subscriptions:
      - type: fabric.usage
        version: 3
    handler:
      format: json
      type: amqp
      url: amqp://usr:pwd@host.name:port
      queue: events

I can not find the parameter you have mentioned in the event configuration section

events:
  usageLogger:
    subscriptions:
      - type: fabric.usage
        interval: 5s
    handler:
      type: amqp
      format: json
      url: "amqp://localhost:5672" 
      queue: ziti
      durable: true      // default:true
      autoDelete: false  // default:false
      exclusive: false   // default:false
      noWait: false      // default:false
      bufferSize: 50     // default:50
1 Like

Does it make sense to increase heap limit to run gc? If it is so how to do that?

heap-in-use: 58.71MB (61562880 bytes)
next-gc: when heap-alloc >= 76.70MB (80427482 bytes)
ziti agent memstats
alloc: 38.11MB (39964080 bytes)
total-alloc: 8.87GB (9523829352 bytes)
sys: 134.74MB (141287704 bytes)
lookups: 0
mallocs: 229299369
frees: 229004865
heap-alloc: 38.11MB (39964080 bytes)
heap-sys: 122.00MB (127926272 bytes)
heap-idle: 63.29MB (66363392 bytes)
heap-in-use: 58.71MB (61562880 bytes)
heap-released: 54.42MB (57065472 bytes)
heap-objects: 294504
stack-in-use: 2.00MB (2097152 bytes)
stack-sys: 2.00MB (2097152 bytes)
stack-mspan-inuse: 1018.91KB (1043360 bytes)
stack-mspan-sys: 1.77MB (1860480 bytes)
stack-mcache-inuse: 2.36KB (2416 bytes)
stack-mcache-sys: 15.34KB (15704 bytes)
other-sys: 813.40KB (832926 bytes)
gc-sys: 5.21MB (5458504 bytes)
next-gc: when heap-alloc >= 76.70MB (80427482 bytes)
last-gc: 2025-08-01 16:09:48.146731178 +0000 UTC
gc-pause-total: 66.978441ms
gc-pause: 327296
num-gc: 288
enable-gc: true
debug-gc: false

Probably the memory is filled up with unbounded number of api-sessions (left behind zrok agent remoting). Is there a way to purge these sessions?

I come to a conclusion that the ziti-controller was overloaded by the network traffic generated by zrok-controllers (deployed in an array, 7 in total). As the number of api-sessions grows at some point the ziti-controller is not able to handle all these requests. It runs out of cpu and memory.

These api-sessions were accumulated over a number of days. So their number was very huge.

Are the api sessions being kept alive? If theyโ€™re not being actively used, but the expiration time is too long, you can try setting the sessionTimeout to a shorter value, see: Controller Configuration Reference | OpenZiti

How big of a user community are you hosting that requires 7 zrok controllers?

I think 30 mins is fine.

The zrok agentremoting api-session live as long a zrok-controller runs = indefinitely.

So after installing the zrok 1.0.6 I started to use agent remoting. Over several days the number was growing steadily. Then ziti-controller hits its limit. Killed.

Then I started to observe the number of sessions. I have seen that the traffic grows up. The cpu usage increases over days. I have found that agent remoting api-sessions never die out.

The only way to remove these api-sessions is to restart the zrok-controller.

After restarting zrok-controller the zitiโ€™s timeout makes the necessary cleanup.

The community is tiny. I placed the zrok-controllers in EU and non-EU countries. Some people have no access to EU data centers, even there is a problem to access servers in their home countries. It is a race against time. I hope to install zrok in HA mode just before their regulator cuts off them from my ziti-controller. The idea is to place a slave node in their home country and the master node in EU where I can easy access it from home to make the changes.

But 7 controllers are able to generate a lot of api-sessions and kill ziti-controller at some point.

Importantly to note that these api-sessions never die out. They are alive till I reboot the zrok-controllers.