If all identities are going offline, my expectation would be the controller and/or routers were all disconnected from the internet.
You state you're using 'GKE spot instances'. My immediate inclination is that these instances are getting rate limited for CPU, taken offline, paused or something along these lines. This feels highly likely to be the problem to me.
NetFoundry operates hundreds of openziti networks and we've never had this particular observation to my knowledge. We've definitely seen cloud providers throttle / pause instances before. It sounds like what's happening to me.
Ok then probably I'll switch controllers from spot instances to standard nodes and I'll keep monitoring that.
So general it's not recommended to run on spot instances?
"Generally not" would be my response. But I mean, as long as they work for you, I'd say go for it. If you can tolerate these 1-2 minutes of random disconnectedness and things are fine afterwards, it sounds like they are working "well enough" for you. If you can't tolerate that, then I'd suggest moving to something that has more clear and discreet guarantees.
This particular issue is often a bit tricky to track down in practicality just because it looks like everything "just stops" for a minute. Nothing in the logs etc. That makes debugging the issue difficult. It sounds like that's the issue what you're hitting and it can be quite a frustrating thing to track down. I would definitely recommend you take that out of the equation, but from the description this doesn't sound like anything I've seen or heard about that would be OpenZiti-related.
I ve moved to standard nodes, looks ok now, but still on my IOT devce which is runnign edge tunnel i get below error and node disconnect for few seconds and coming back online ,
logs
Mar 17 01:00:01 aly-gw-1 ziti-edge-tunnel[3857600]: About to run tunnel service... ziti-edge-tunnel
Mar 17 11:41:08 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 38466.984] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:192.168.0.223:52043 err=-14, terminating connection
Mar 17 11:41:08 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 38466.993] WARN ziti-sdk:channel.c:553 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[39]
Mar 17 11:44:58 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 38697.605] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:192.168.0.223:54051 err=-14, terminating connection
Mar 17 11:44:58 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 38697.606] WARN ziti-sdk:channel.c:553 dispatch_message() ch[0] received message without conn_id or for unknown connection ct[ED72] conn_id[53]
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:35:25 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45324.032] WARN ziti-sdk:bind.c:463 on_message() binding failed: -17/ziti edge router is not available
Mar 17 13:36:17 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45376.003] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:60716 err=-13, terminating connection
Mar 17 13:36:50 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45409.028] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:60716 err=-13, terminating connection
Mar 17 13:39:00 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45539.027] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:45920 err=-13, terminating connection
Mar 17 13:39:31 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45570.039] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:45920 err=-13, terminating connection
Mar 17 13:41:49 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45708.054] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:60354 err=-13, terminating connection
Mar 17 13:44:08 aly-gw-1 ziti-edge-tunnel[3857600]: (3857600)[ 45847.208] ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:46668 err=-13, terminating connection\
im runnign ziti controller and router on GKE private cluster
on router this is my configuration for forwarder, do i need modify this?
forwarder: β
β latencyProbeInterval: 10 β
β linkDialQueueLength: 1000 β
β linkDialWorkerCount: 32 β
β rateLimitedQueueLength: 5000 β
β rateLimitedWorkerCount: 64 β
β xgressDialQueueLength: 1000 β
β xgressDialWorkerCount: 128
Not really. It's entirely dependent on how many clients there are and how much data they are pushing and what their communication profiles look like.
I would recommend you monitor CPU and network. If either of these are high, it's time to add another router, more CPU etc. Generally speaking though, I would expect 2cpu and 8gb of RAM to go a long way for starters.
Operating a network at scale is a challenge. If you're under 50 (50 is just a random number that's a very modest number I choose) identities though, there must be something more fundamentally wrong with the underlying vm/hardware overlay itself. This isn't something I've seen with any of the networks NetFoundry provides. I also haven't ever seen it on the networks I run. I use all the default values.
I think you still have something external to OpenZiti causing you problems. That's what it seems to me if you're getting that error intermittently. This seems to me to be a monitoring challenge
I believe Iβve identified the root cause β one of my main production Edge Routers on AWS EC2 had a full disk, which prevented it from writing syslogs. This likely triggered repeated connection failures from all identities, eventually causing intermittent issues with the Ziti controller and the public Edge Router.
Iβve cleared the old logs and brought the volume usage down from 100% to 45%. So far, there havenβt been any dropped connections. Iβll continue monitoring to ensure stability.
@qrkourier one Quick question: If I installed the router using a JWT file during the initial setup using helm upgrade --install, do I need to use the same JWT file again when performing an upgrade?
I assume the JWT was only required for the initial enrollmentβam I right?
Also, could you explain how helm upgrade works in this context?
Because I get this intermedient issue only after I upgrade router couple of month before from 0.36 to 1.1.5
I just old jwt file as a placeholder to upgrade router via helm.
Do u think will that be any issues on this?
No, there's no problem supplying a dummy value for enrollmentJwt during Helm release upgrade operations. You will only need to supply a valid enrollment token during initial or re-enrollment.
The value is used to set an environment variable for the router which is only used for enrollment, so an empty value will simply be ignored if the router's PVC already contains an identity certificate obtained during enrollment.
Did you encounter an error when enrollmentJwt was empty or undefined? In my testing just now, it was only required for enrollment and it was unnecessary to supply any dummy values for subsequent upgrade operations that did not entail re-enrollment.
Ok yeah I didn't had issue while running I'm just trying to trace out my issues.
Probably as a last option I'm running edge tunnel in GKe as daemonset probably I'll switch to reverse proxy pod and check if that is becoming stable.
Not sure what I'm missing it it is tough to trace the issue.
What's the symptom of the problem you encountered intermittently?
I assume the symptom correlates with your log messages about gke-fabric-router being unavailable.
Some router unavailable messages could point to a condition where an edge router is advertising a listener that is not reachable, which may be harmless as long as at least one edge listener is reachable.
However, an edge router that was in use and became unavailable seems to point toward that router's deployment being terminated or unpublished in some way. I'd check a correlated time frame from that edge router's log to learn if it encountered any internal error, and look for evidence that the router's ports were available or not during that time frame.
Are you able to trigger the failure mode reliably?
My controller , router are running on standard nodes.
not sure where to troubleshoot and make it more stable.
Im running only one router as public like on url.
is there anything that i need to increase like timeout or something like that?
One more wierd thing, i was using 0.36 version earlier and then i migrated to v1.1.15 and move to spot instances in OCT 2024, after that im getting this issue very frequently and i thought it is related to Spot instance, but i wonder if there is any issue with 1.1.15 version? do u need to me upgrade and check?