I've been testing the online status (hasEdgeRouterConnection) of my ziti-edge-tunnel network clients. During testing I noticed that when my devices are abruptly disconnected from the network or loose power, it takes about 30 minutes for their online status to update in the ZAC. How does Ziti determine whether devices (Ziti-edge-tunnelers) are online or offline?
Is there a timeout setting that can be configured on the controller to adjust this?
-> I have found the sessionTimeout setting, but this seems to be related to the API session of the respective client (ziti-edge-tunnel) to the ziti controller. Does this timeout also affect the hasEdgeRouterConnection status ?
I thought for sure I had replied here but I don't see my response... 30 minutes is the default session timeout. So the "has api session" bubble showing up for 30 minutes makes sense to me as to why you might see it on an abrupt outage. At that time you should lost "edge router connected" though.
The "online-ness" of identities has been coming up a lot lately and I expect we'll be making changes to it in the coming future. Until then, I would expect you to see a "has api session" for as long as the session timeout is configured on power outages. In your controller find edge.api.sessionTimeout and change it from 30m to something else if you want. I don't really think you should, just because clustered controllers might change it too.
If I were to guess, it sounds to me like we are hearing from the community that more precise information regarding the overall health of the device is what people are actually desiring. It would probably be helpful overall if you could let us know what the exact use case is you're looking to solve as it might inform our responses here and future direction regarding these indicators.
Ok, so if I get that right, I can only influence the hasApiSession flag by setting the edge.api.sessionTimeout, but not the hasEdgeRouterConnection one ?
So in case of a power outage of the network client (e.g. ziti-edge-tunnel or sdk-embedded app) the hasApiSession value won't change until the timeout of (by default) 30min has passed.
How should the hasEdgeRouterConnected flag be interpreted then ? As I understand - or least I assume - the routers maintain a connection to the clients via the use of some sort of heartbeat, so they should actually recognize that a client has gone offline rather quickly. But it seems that this is not what the hasEdgeRouterConnection property reflects
I would flip this, but admittedly, I don't know for sure the mechanism here as I didn't work on it. I would think the clients maintain connections to routers and are responsible for reconnecting and I would expect if the client isn't noticed by the routers it is connected to, after some small amount of time it would be reported to the controller as not having a connection to that router any more. Once all routers the client was connected to report to the controller that it's not longer connected, I'd expect the controller to mark it as having no router connections.
I would expect a process suddenly terminating to reflect in the ZAC as having an api session for 30 minutes, and not having a router connection after "a few moments" (maybe like a minute?)
You say that doesn't seem to be what you're seeing, but that's what happened for me in my test. Maybe you're doing something I'm not?
I used the ziti cli to start a 'server' that binds a service:
ziti ops verify traffic --prefix connection_test --mode server
Well, this one is a "recent build from source" so I don't know the exact version, but I'm usually running "pretty new/recent" deployments. I'd guess this is 1.4+ for sure, it might be 1.5+... "pretty new"... However, this functionality I doubt has changed much recently (i reserve the right to be wrong)
The 1.2 release was focused on changing how online status was calculated. Instead of API session heartbeats, it's now focused on connect/sdk events. There's a controller config flag which indicates if the system should use api session heartbeats, connect events or both to manage the online status.
Oooook so, I was now testing with controller/router V1.4.3 and ziti-edge-tunnel V1.5.4
I have now replicated your test and also got the same result - the hasEdgeRouterConnection bulb went grey after ~15 secondsafter killing the Ziti process within my WSL. Additionally, I tested the same on a different machine than mine - basically a lower power device, but also running Linux - there I had a similar result. In this case, it took ~40 seconds until the hasEdgeRouterConnection bubble went grey after killing the Ziti process on this machine.
But when I did the same test, except instead of killing the process just cutting off power for the device, it took ~10 minutes until the routers have recognized that this machine was really offline, because it took 10 minutes for the hasEdgeRouterConnection bubble to turn grey after the power cut.
I noticed the same behavior when testing within AWS by running a EC2 instance and installing the Ziti-edge-tunneler (V1.5.4) and then blocking any traffic using NACLs. Here, it also took roughly 10 minutes until the hasEdgeRouterConnection changed from online/green to offline/grey.
I think what is happening here (and I am not a networking expert ) is that in case we kill the process under Linux, the Linux kernel closes the socket, which was owned by the now-killed process, and sends a TCP FIN (?) to the remote - which is a Ziti-router. But when the power is cut off or the network cable is unplugged or the network just suddenly stops working, the Linux kernel obviously cannot send that TCP connection termination info, and thus the Ziti-routers do not recognize (at least not within a short amount of time) that a client has gone offline.
I saw that in the release notes, but I could not find where I should set the setting described. In the ziti docs -> controller config section it does not state the identityStatusConfig setting
Also I am running the whole setup in K8s and the helmChart does also not allow specifing that setting But if you tell me where I can set that, I can test it by just manually changing the respective configmap in K8s
You may append arbitrary, additional configuration to edge section of the Kubernetes controller's ConfigMap by defining the Helm chart's input value .Values.additionalConfigs.edge (a dict).
Just tested adding these options and repeated my test within AWS (EC2 instances with ziti-edge-tunneler & prohibit any traffic using NACL) and experienced the exact same behavior as before. Meaning it took ~10min until the hasEdgeRouterConnection property/bulb was false/grey, also when setting the options
Can you confirm the Helm release was upgraded with the new values, just to ensure the template is placing the controller configuration directives in the expected location?
You reported your controller version is 1.4.3, and the new config directive was introduced in 1.2.0, so we just need to verify the controller is configured correctly.
kubectl get configmap ziti-controller-config --namespace=ziti --output=go-template='{{index .data "ziti-controller.yaml" }}'
paging @plorenz back in because we proved the controller is now configured with the new directives in edge.identityStatusConfig, but the symptoms persist, right @janst? I didn't fully comprehend the problem, but wanted to remove any friction associated with the Helm chart, at least.