I've been playing with OpenZiti to lock down my management network and I have it configured as such:
Windows Laptop (Windows 10) uses ZDE from my "Trusted VLAN" > Connect to my internal reverse proxy (HAProxy) which does SSL termination and gives things a nice domain name (app.int.domain.net) > Service (in different vlans).
I have it so my reverse proxy blocks ips other than OpenZiti's IP (the edge router binds the service right now) and if I access outside of the Ziti tunnel I get access denied which is great... but I connect to the Ziti tunnel and only SOME services are working, and it seems consistent for example my Proxmox web interface will NOT work.
I need to wait a long time or keep disconnecting/reconnecting and disabling/reenabling the identity and ziti service until it starts working. It happens almost every time I wake my laptop up and need to start using it again. In contrast, if I start using my Semaphore (ansible webui) service which also goes through the same reverse proxy, with the same restrictions and is setup the same way in OpenZiti, it works right away even while proxmox is not. Both domains exist in internal DNS as well as an intercept. I've updated to the latest software for ziti thinking that might fix it, but it hasn't.
Any ideas on where I could look or how I could troubleshoot this further?
Interesting write up, thanks for the details. One thing caught my attention more than anything and it's "It happens almost every time I wake my laptop up and need to start using it again"
Here's what I would do.
I would:
stop my ZDEW data service aka ziti-edge-tunnel (the big green button on the ZDEW UI or net stop ziti)
stop the monitor service (using the services applet or net stop ziti-monitor)
quit the ui - Main Menu -> Quit Ziti Desktop Edge
clear all my logs (remove the entire C:\Program Files (x86)\NetFoundry Inc\Ziti Desktop Edge\logs directory)
open the UI
start the monitor service (using the services applet or net start ziti-monitor)
start the ZDEW data service aka ziti-edge-tunnel (the big green button on the ZDEW UI or net start ziti)
change to DEBUG level logging
reproduce the issue
from the UI capture Main Menu -> Feedback
the last line will make a feedback zip file which is a bunch of diagnostic data along with logs. If you are comfortable, send that feedback zip file to me via email clint at openziti.org and I'll take a look at it.
I would open the ziti-edge-tunnel logs found in the service directory and start looking for the service you're having troubles with in there. I would make sure you see log messages such as:
[2025-03-18T01:07:11.666Z] INFO tunnel-cbs:ziti_dns.c:566 format_resp() found record[100.100.0.2] for query[1:mattermost.tools.netfoundry.io]
and
[2025-03-18T20:05:23.774Z] DEBUG ziti-sdk:connect.c:553 process_connect() conn[0.2/I2aUZ1s7/Connecting](mattermost.tools.netfoundry.io) starting Dial connection for service[mattermost.tools.netfoundry.io] with session[cm8ex6bxnm1d1kjp58pbdl57t]
Thankyou for your quick response. I've managed to re-produce it (though it wouldn't do it if my laptop went for a "short" sleep... noticed it had to be sleeping longer - perhaps a network timeout?).
I followed your instructions and sent it via email to you. When I was able to re-produce it, I noticed the issue mentioned in another thread here, where the GUI shows everything is connected and no login is needed (no IDP icon) but yet you cannot connect to services... and then disabling and re-enabling the identity I could get to some but not to pve.int.x.x (blanked out domain for privacy on the public forum here). I noticed there's an error a bit after finding the record, saying "unexpected eof while reading".
Either way perhaps these logs will help fix both of these issues?
Mostly correct, yes. It's interesting that you say you should not need to login (I assumed you did need to, and that the real issue is that the app did not detect this... But apparently there's two issues combined in that it doesn't detect that the access is no longer working AND there's a break in the connection despite Authentik having the offline_access scope).
Time: I haven't measured how much time it takes but 5-10 minutes would be around the "long" time I waited yes... My guess is if you close the lid on the laptop that it may not fully go to sleep for a short period in the event you change your mind.
To add to the end of your list... After you re-enable the identity and then login to the IDP again, proxmox was not accessible still despite other services being accessible still (e.g. zabbix, semaphore) which was the original issue here until you helped me find more
More testing: so I noticed it seems related to the network connection of my laptop... I disabled the wifi connection (which would happen during sleep as well if sleeping long enough) and after 2 minutes turned it back on .. Ziti thinks the connection is live (no graphical indication of needing to reconnect or authenticate) and I get "err_,connection_refused" in chrome, if I ping the PvE intercept I still get the 100.64.0.10 IP addresses though and it responds to ping. I then disabled and re enabled the identity which it then required re-authenticating to my IDP and proxmox worked (this time). I'll do more testing tomorrow to see how consistent it is.
Ok so on further testing... it seems to be happening less often than it was before (maybe the offline_access scope change i did the other day) but still occurs fairly frequently if I disable the connection or have the laptop sleep for a while and then re-enable the ziti service (power button) and then disable, re-enable it again. If you're patient, it will work after a few minutes though. This seems to be a more certain way to trigger it. It's a tricky thing to troubleshoot.
This behavior makes sense to me. I'll explain the whole flow at the end of this post to hopefully make it clear. In general, we expect most people to never need to turn off the service. I'm sure you have good reasons for disabling an identity/turning off the tunneler, but that is not how we currently operate.
Oh. Interesting. So you're saying it DOES recover after what a human (me or you) would consider "far too long" -- like minutes. This is also a helpful observation, thank you!
do you know how long your refresh token is valid for? It could be that?
At this time, the tunnelers only store the refresh token in memory and that memory is erased when the user turns an identity off. This is why every time you turn off the service/the identity you are forced to reauthenticate. This behavior could change but we would need to implement it (aka -- more work) and it is a valid security concern since the refresh token could be used to obtain other tokens.
How It Works Right Now
This part of the post will describe how this all works. It's in the weeds a bit maybe so skip it if you're not interested in the "whys" and details.
User turns on a tunneler
tunneler tries to authenticate to the openziti overlay
overlay indicates the tunneler can use external authentication
tunneler knows the external providers configured and lets the user pick one to auth with
tunneler and user complete an Auth code with PKCE flow to the authorization server. this results in two or three tokens being retrieved from the authoriztion server: an ID token, an Access token, a refresh token. Not all authorization servers support refresh tokens or the offline_access scope fwiw. Keep note of this refresh token, it's important later.
tunneler uses the token specified in the external jwt signer to authenticate to openziti overlay
user does tunneler things.... makes http requests etc. every request to dial requires a valid token from the authorization server
at some point the tunneler realizes the access token (or id token) is going to expire.
IF the tunneler has a refresh token from the authorization server, it's used to obtain a new access/id token from the authorization server
IF the tunneler could not obtain a new token the UI should change over to the "authorize idp" view and the user will need to authenticate to the idp again
the new id/access token is then used for new service dial requests (http, ssh, whatever)
rinse repeat this cycle until the refresh token expires
Ok, hope this helps and thanks so much for your insights and observations so far. I have not had a moment to be able to test/replicate this issue but I am trying to get to the point where i can. Cheers!
Apologies for the delay in replying - it's been busy for me in the last bit!
While this sounds great in theory... the truth is you have disconnection events, where a laptop is put to sleep or a PC is shutdown, a phone/tablet/pc restarts or loses connection. I also experience issues sometimes where my network connection totally breaks on android and I need to disable Ziti (and re-enable again).
So I say this and it still is occurring fairly often with my laptop. I checked and my provider settings for Open Ziti on Authentik are set to:
Given this, I would expect if I were to turn off (or lose connection/disable identity) for Ziti that I could re-establish the connection without needing to re-authenticate with my IDP for at least a few minutes given the access token validity (but maybe the refresh token time?). I agree having this time too long is a security concern, but seconds or even a minute shouldn't cause a full re-auth event I would think. I've seen other apps have this approach to make the user experience is a little smoother.
Love this breakdown, thankyou! I apologize if it's already been mentioned but just having that IDP window open automatically when you enable an identity/the ziti service/reconnect (after a reboot/wake up/connection restored) instead of having to open ziti, find the IDP icon, and click it would be a huge boost in user satisfaction, especially if combined with the previously mentioned idea of not requiring complete re-auth if a token is still valid (e.g. within seconds, or a minute or a user defined value in the IDP - such as one of the token times that can easily be configured).
Appreciate your responses and your teams great work on this software!
I think maybe there's a misunderstanding. The OpenZiti tunneler is expected to tolerate all of these events and work properly, meaning that in practice, it never needs to be turned off. All of the examples you cited should absolutely be covered in an OpenZiti tunneling experience and should work as how you (and I) expect. If they aren't (and it seems like there are time it doesn't), those are just bugs we need to find / fix or doc we need to add to explain "why and how" you have to login again (refresh tokens expiring, etc).
It's exceptionally common for refresh tokens to be long-lived like this. Given your scenario, I would expect you to have to authenticate "manually" every 30 days. It just seems like we need to find/fix a bug here.
That's not something I'd considered previously, but it sure seems like a decent feature request to me. Basically, if you have authenticated previously with an IdP, the UI would remember that and if the UI would show the "needs external auth" icon, I could see a setting that allows for the user to choose to "automatically initiate the flow". I can see that. I'll bring that up and put an issue in for it. Thanks for that idea.
Ah ok, so it definitely does seem to be an issue with my setup then. I've had issues with both the windows tunneler and the android one - similar in that it appears connected, the hostname resolves to the ziti overlay IP, but I'm not able to access services. My Android thread is here: Issue with Android tunneler connectivity
I suppose the next step is figuring out exactly what is causing these issues. Is there any more detailed logging I should turn on to diagnose this? According to the Android thread @ekoby mentioned: "the router logs indicated some issue between controller and router. management connection was being reset frequently and router was not getting updates, which led to endpoints not being able to connect to the router". Given my router and controller run on the same LXC, I'm not sure why they would have these issues.
Should I look at the router or the controller for issues?
What logging can I turn up to understand the issues more/send over
Although it does seem to be related to the connection being reset or time - it's very difficult to pin down exactly what triggers this, any suggestions on how to do that further?
Sorry no, that's not what I was implying. If you're having problems, I think there's some kind of bug we need to track down is what I was trying to imply. I was able to reproduce the issue by sleeping my machine, but we've not had the time to track the issue and fix it yet. I think this is just a bug we need to track in the code and fix.