We’ve just today discovered a problem that I believe is caused by wildcard services.
After creating the service as described by @scareything, we noticed (a lot!!) of „circuit idle threshold exceeded“ logs. A couple minutes after creation it was approximately 4000/minute but raising pretty quickly. As far as I know we didn’t do anything „special“ after creating the service like renaming or such.
A restart of the controller temporarily fixed the problem.
However this morning we ended up with a couple hundred thousand logs and significant impact on the whole fabric. We also observed that the circuitCount showed ~18000.
Any ideas on what this could be caused by? What additional info can I provide? Are there logs that associate a circuit to a service?
Running the newest Ziti v0.31.0
Yikes! It's a light week here for the OpenZiti team fyi. I'd start by collecting the logs from the controller and the routers if you can. You can email them to help at openziti.org or share them some other way if you feel comfortable with that. If not, does removing the service help? Can you put it back and recreate the problem?
It always helps developers tremendously if you can reliably reproduce the problem with associated steps, if you have them?
Thanks. I've mailed the logs.
The problem is gone after deleting the service - but not the wildcard service as I suspected which makes it even more unclear to me. Gosh, I hate problems I can't reproduce.
Maybe you guys find something in the logs or I will find a way to reproduce in the future.
One thing I've just thought of:
I've created two services, one wildcard "many to one" service and one non-wildcard service to allow for traffic in the other direction, I've just deleted the non-wildcard service. Both services had the same role attribute. Is this a problem?
What exactly does circuit exceeds idle threshold mean? No usage of the circuit? What's the result of the warning? Is the circuit closed afterwards? How is the threshold set?
You can see pretty well how the amount of logs build up steadily:
That kind of dial rate usually indicates you are capturing your own traffic. That is, the same device is hosting and dialing the service. The loop forms as the traffic is initiated, intercepted, emitted by the same device (hosting), recaptured, wash, rinse, repeat. The bestway to start looking is the policy-advisor. For the services in question, run the policy advisor and see if any device has both Bind and Dial permissions. That is usually an indicator of the problem config. You can have valid configs of that, especially with addressable terminators, but it isn't very common.
If it isn't that, it would seem that you have something not closing the circuit properly. Looking at the actual traffic that is being intercepted, you might see a particular application that is not closing connections well. The idle circuits will accrete and are emitted per minute, so they will rise over time even if the rate of creation is linear, as they don't age out, and are relogged each minute they remain.
Hi @dmuensterer ,
Idle circuits are circuits which are established but haven't had any traffic flowing on them. When we detect these, we check with the controller to make sure they are still valid. If the controller controller doesn't know about them, it will let the router know they can be cleaned up.
We don't currently close them, as there are some use cases where users have long lived connections that only send traffic very sporadically.
Orphan circuits happen sometimes when circuits are re-routed, as the old path isn't cleaned up immediately, rather we let the this GC process handle the cleanup.
The settings are configurable in the router config file.
forwarder:
# values specified in milliseconds
idleCircuitTimeout: 60000 # how long to wait before checking
idleTxInterval: 60000 # how often to scan for idle circuits
If this turns out not to be the issue that @mike.gorman highlighted as a possibility (of the tunneler intercepting its own traffic), let us know and we'll dig into it further. If you can make it happen reliably, we should be to track it down.
If you run that, that will let us know what the routers know about the circuit and will help us narrow down where the issue is. We should be able to tell if the routers have a valid circuit (which would indicate a problem with the tunneler or external application(s)) or if some close notifications are getting lost and the problem is in the fabric.
OKAY : zabbix.ziti (2) -> wildcard.ssh (2) Common Routers: (2/2) Dial: Y Bind: Y
And so easy to troubleshoot if you know where to look
That might be something where an alert could be displayed in ZAC even though there might be edge cases where this is valid?
Any ideas?
Maybe one more hint on the setup to recreate:
There's one wildcard service that allows SSH access from #admin to #ssh
There's another wildcard service that allows for TCP/10050 from @zabbix.ziti to #zabbix_agent @zabbix.ziti also has the attribute #ssh.
Is this a problem? Looks like the two wildcard services somehow interfere with each other?
Hi, I've found a way to reproduce by using 2 wildcard services with the following 3 identities and attributes.
And a small explanation of what I'm trying to achieve: #admin should be able to dial all *.ziti addresses to access port 22. #ssh should let out the traffic on port 22 dialed by #admin @Monitoring should be able to dial all *.ziti addresses to access Port 10050. #monitoring_client should let out the traffic on port 10050 dialed by @Monitoring
Hi @dmuensterer
The circuit inspects show a complete, valid circuit in the fabric, so the problem is likely either with the tunnelers or external (or you've got a loop problem). Some applications don't close cleanly, and that can leave dangling circuits.
If you've got a loop issue, you'll see more circuits than you expect. If the number of circuits that's building up looks like it corresponds with the number of connections that have been made, then it's probably a problem with incomplete closes.
We have seen similar issues elsewhere and we're working on the following to mitigate this:
Adding a configurable keepalive in the tunnelers (probably in the service config), so that if there's an incomplete close we detect it.
Adding a configurable, per-service max-idle time, so that idle circuits can get cleaned up automatically.
I did some more digging and it indeed seems like Zabbix is the source of the problem. Zabbix is a monitoring system and we're trying to establish a connection from the monitoring server to the agents.
What exactly do you mean by that? Open TCP connections? Just trying to understand also what an application can do to prevent the problem.