This morning I received an update to version 2.34 of the macOS Desktop Tunneler and - it seems to introduce some issues:
Connections take quite a long time to establish. Often I get no IP from DNS or connection timeouts.
Established connections (ssh or kubernetes API sessions) get spontaneous disconnects
When the connection stays connected, it feels 'laggy' - typing in an SSH session feels like over a high latency line with spontaneous packet loss (session hangs for several seconds before the typing appears).
I finally downgraded to 2.33 and now things work like a charm again. Controller and Router are on 0.31. Endpoints are behind linux edge tunnelers version 0.22.13
Hello and thanks for the report. Do you happen to have the appex and app logs from when 2.34 was running? I’ve not seen anything like this but I’ll take a closer look today.
Thanks for the logs. I've taken a first look and nothing is jumping out at me, at least nothing that makes sense with 2.33 working fine and 2.34 not working. I'll keep digging.
One thing I noticed is that one of your ziti services seems to be intercepting the DNS server that the host is configured to use. Is this intentional?
Hi @scareything,
Interesting discovery with the DNS server. Actually, it's not intentional - I added the network a few weeks ago for troubleshooting purposes, and the network I'm currently connected to has the DNS server in this IP range. I didn't notice this, and it also didn't cause any problems for weeks
Will check this out - the IP range forward shouldn't be needed anymore.
I just upgraded back to 3.34 and it worked fine for 15 minutes. I think this morning it was fine at first and the problems started after a while. I'll keep you posted.
I have been keeping an eye on it and I think i have an idea of when the problems occur: It seems to be related to when I use my IaC tool 'pulumi' in parallel. Pulumi opens quite a few sessions for all the targets it handles (all over OpenZiti in my scenario).
Yesterday I had two spontaneous SSH session disconnects - the first at 21:15 and the second at ~21:52.
Now that I have an idea how to reproduce it, I tried to build a test setup and trigger it. But it is not easy to reproduce. There is also a video of the session in the collected data, and I already thought I wouldn't be able to reproduce it. On the video you can see 4 terminals (at least from minute 3 on):
The top one is a k9s session showing some Kubernetes pods being created and deleted by pulumi.
The next terminal shows a tcpdump of the ziti interface
The 3rd is the pulumi output
The bottom terminal shows an SSH session. In this case the same box where also the Kubernetes lives.
I tried several loops and it worked fine. But then ~13:12 (video playtime) - I just typed some text into the terminal - the session froze. And then at 13:25 the SSH session terminated and also the pulumi exited with a read error.
At this point it appears that pulumi was uploading it's 'state' data to the s3 storage - so it was doing a bulk upload. The fact with the s3 upload makes me think that this might be related in some way to this thing plorenz is on: https://openziti.discourse.group/t/how-to-track-down-throughput-problems-on-the-ziti-router/ . But still - I don't have such disconnects with the 2.33 version. To correlate the logs with the video: Take the timestamp from the tcpdump output and offset it by -1 hour.
I hope you can get an idea of what's going on from the logs.
Thanks for getting all of this info together. I agree that whatever you are seeing here is specific to the 2.34 release. I recently explained your issue to @ekoby, who does a lot of the ziti SDK and underlying crypto api work, and he thinks there may be an issue in the tls library that made it's way into ZDE 2.34. A fix for this issue is in place, but not yet released. I'll keep you informed as we move this fix through the chain of dependencies that go into ZDE, and in the meantime I'll be looking closely at the updated logs that you just sent.
Hi,
Thanks for the update! I assume the tlsuv is also used by the Linux tunneler? So I should wait with updating the Linux tunnelers until this fix makes it into the next release...
Indeed, the latest version of ziti-edge-tunnel uses the same SDKs/versions as ZDE. I don't know if your ZETs experience the same load and circumstances as your ZDE client that exhibits the issue, but it would be wise to delay updating the ZET clients to 0.22.15 until we have this sorted out.
I'm noticing another problem: sporadically, DNS resolution fails for individual hosts, while other hosts work. So it doesn't seem to be a general problem. After some time (minutes) the problem disappears.
I have the tunneler logging set to debug, but there don't seem to be any DNS logs on it. Are there any known issues? How can I investigate further if this happens again?
You might see some different messages when using the host command - it sends MX queries, which the ziti DNS server proxies to the ziti endpoint that's hosting the service that's associated with the queried domain. In that case you'll see something like this:
Either way, you should see an intercepted address[udp:100.64.0.2:53] message (at DEBUG) for every DNS packet that ZDE intercepts. If you aren't seeing those then the packet tunnel provider likely isn't reading the packets, possibly because the libuv loop (which is the sole context of execution for ZDE) is occupied in a long-running function - most likely because something has gone awry.
Busted - I seem to have downgraded it again. It was when the 2.33 was running. And it seems that the 2.33 doesn't log log any DNS requests.
Okay, good to know. I'll keep an eye on this when I'm on 2.34 or later.
Sorry for the confusion.
Actually I don't blame you at all for downgrading. 2.34 is a bit flakey especially with any load, and actually I was hoping this DNS issue might be related to all of that. As you noticed, 2.33 doesn't initialize the SDK loggers so there isn't much to see in those logs.
I just submitted updates for the desktop and mobile tunnelers to the apple app stores. It can take anywhere from a couple of hours to a couple of days for the App Store review to complete - look for version 2.35 in the App Store. I didn't bother putting this version up in test flight because we're pretty confident in the fix.