macOS Desktop Tunneler 2.34 (493) - Disconnects & Laggy

Hi,

This morning I received an update to version 2.34 of the macOS Desktop Tunneler and - it seems to introduce some issues:

  • Connections take quite a long time to establish. Often I get no IP from DNS or connection timeouts.
  • Established connections (ssh or kubernetes API sessions) get spontaneous disconnects
  • When the connection stays connected, it feels 'laggy' - typing in an SSH session feels like over a high latency line with spontaneous packet loss (session hangs for several seconds before the typing appears).

I finally downgraded to 2.33 and now things work like a charm again. Controller and Router are on 0.31. Endpoints are behind linux edge tunnelers version 0.22.13

Any hints or known issues?

Bye,
Chris

Hello and thanks for the report. Do you happen to have the appex and app logs from when 2.34 was running? I’ve not seen anything like this but I’ll take a closer look today.

Thanks,
-Shawn

Hi @scareything ,

Thank you for having a look on it. I sent you a PM with the link to the logs!

Thank you!
Chris

Thanks for the logs. I've taken a first look and nothing is jumping out at me, at least nothing that makes sense with 2.33 working fine and 2.34 not working. I'll keep digging.

One thing I noticed is that one of your ziti services seems to be intercepting the DNS server that the host is configured to use. Is this intentional?

Sorry for the additional ask here, but would you be able to increase the log level to TRACE and DM the resulting logs from 2.34 again?

Hi @scareything,
Interesting discovery with the DNS server. Actually, it's not intentional - I added the network a few weeks ago for troubleshooting purposes, and the network I'm currently connected to has the DNS server in this IP range. I didn't notice this, and it also didn't cause any problems for weeks :wink:
Will check this out - the IP range forward shouldn't be needed anymore.

I just upgraded back to 3.34 and it worked fine for 15 minutes. I think this morning it was fine at first and the problems started after a while. I'll keep you posted.

Thanks,
Chris

1 Like

Hi @scareything ,

I have been keeping an eye on it and I think i have an idea of when the problems occur: It seems to be related to when I use my IaC tool 'pulumi' in parallel. Pulumi opens quite a few sessions for all the targets it handles (all over OpenZiti in my scenario).
Yesterday I had two spontaneous SSH session disconnects - the first at 21:15 and the second at ~21:52.

Now that I have an idea how to reproduce it, I tried to build a test setup and trigger it. But it is not easy to reproduce. There is also a video of the session in the collected data, and I already thought I wouldn't be able to reproduce it. On the video you can see 4 terminals (at least from minute 3 on):

  • The top one is a k9s session showing some Kubernetes pods being created and deleted by pulumi.
  • The next terminal shows a tcpdump of the ziti interface
  • The 3rd is the pulumi output
  • The bottom terminal shows an SSH session. In this case the same box where also the Kubernetes lives.

I tried several loops and it worked fine. But then ~13:12 (video playtime) - I just typed some text into the terminal - the session froze. And then at 13:25 the SSH session terminated and also the pulumi exited with a read error.

At this point it appears that pulumi was uploading it's 'state' data to the s3 storage - so it was doing a bulk upload. The fact with the s3 upload makes me think that this might be related in some way to this thing plorenz is on: https://openziti.discourse.group/t/how-to-track-down-throughput-problems-on-the-ziti-router/ . But still - I don't have such disconnects with the 2.33 version. To correlate the logs with the video: Take the timestamp from the tcpdump output and offset it by -1 hour.

I hope you can get an idea of what's going on from the logs.

Thanks & Bye,
Chris

Thanks for getting all of this info together. I agree that whatever you are seeing here is specific to the 2.34 release. I recently explained your issue to @ekoby, who does a lot of the ziti SDK and underlying crypto api work, and he thinks there may be an issue in the tls library that made it's way into ZDE 2.34. A fix for this issue is in place, but not yet released. I'll keep you informed as we move this fix through the chain of dependencies that go into ZDE, and in the meantime I'll be looking closely at the updated logs that you just sent.

Thanks,
-Shawn

Hi,
Thanks for the update! I assume the tlsuv is also used by the Linux tunneler? So I should wait with updating the Linux tunnelers until this fix makes it into the next release... :wink:

Thanks again,
Chris

Indeed, the latest version of ziti-edge-tunnel uses the same SDKs/versions as ZDE. I don't know if your ZETs experience the same load and circumstances as your ZDE client that exhibits the issue, but it would be wise to delay updating the ZET clients to 0.22.15 until we have this sorted out.

Hi,

I'm noticing another problem: sporadically, DNS resolution fails for individual hosts, while other hosts work. So it doesn't seem to be a general problem. After some time (minutes) the problem disappears.

I just had it and tried to track it down.

scutil --dns
resolver #1
  ...
  search domain[80] : pub-kube-dev.<ommited>
  ....
  nameserver[0] : 192.168.x.y
  if_index : 15 (en0)
  flags    : Request A records
  reach    : 0x00000002 (Reachable)

resolver #262
  domain   : pub-kube-dev.<omitted>
  nameserver[0] : 100.64.0.2
  if_index : 23 (utun5)
  flags    : Supplemental, Request A records
  reach    : 0x00000003 (Reachable,Transient Connection)
  order    : 100681

host pub-kube-dev.<omitted> 100.64.0.2
<timeout>

I have the tunneler logging set to debug, but there don't seem to be any DNS logs on it. Are there any known issues? How can I investigate further if this happens again?

Bye,
Chris

Are you running 2.34 when this happens?

edit:

And to take a stab at answering your question - you should see some DEBUG messages in the log for DNS activity. For example:

(94336)[     4866.871]   DEBUG tunnel-sdk:tunnel_udp.c:260 recv_udp() intercepted address[udp:100.64.1.2:53] client[udp:100.64.1.1:49310] service[ziti:dns-resolver]
(94336)[     4866.871]   DEBUG tunnel-cbs:ziti_dns.c:238 on_dns_client() new DNS client
(94336)[     4866.871]   DEBUG tunnel-sdk:ziti_tunnel.c:221 ziti_tunneler_dial_completed() ziti dial succeeded: client[udp:100.64.1.1:49310] service[ziti:dns-resolver]
(94336)[     4866.871]    INFO tunnel-cbs:ziti_dns.c:500 format_resp() found record[100.64.1.3] for query[1:zet.fedora-39-vm]
(94336)[     4866.871]   DEBUG tunnel-sdk:ziti_tunnel.c:435 ziti_tunneler_close() closing connection: client[udp:100.64.1.1:49310] service[ziti:dns-resolver]
(94336)[     4866.871]   DEBUG tunnel-sdk:tunnel_udp.c:119 tunneler_udp_close() closing ziti:dns-resolver session

You might see some different messages when using the host command - it sends MX queries, which the ziti DNS server proxies to the ziti endpoint that's hosting the service that's associated with the queried domain. In that case you'll see something like this:

(94336)[     4799.409]    INFO tunnel-cbs:ziti_dns.c:641 proxy_domain_req() writing proxy resolve [{
	"status":0,
	"id":55494,
	"recursive":0,
	"question":[{
		"name":"zet.fedora-39-vm",
		"type":15
		}]
	}]

Either way, you should see an intercepted address[udp:100.64.0.2:53] message (at DEBUG) for every DNS packet that ZDE intercepts. If you aren't seeing those then the packet tunnel provider likely isn't reading the packets, possibly because the libuv loop (which is the sole context of execution for ZDE) is occupied in a long-running function - most likely because something has gone awry.

-Shawn

Busted - I seem to have downgraded it again. It was when the 2.33 was running. And it seems that the 2.33 doesn't log log any DNS requests.
Okay, good to know. I'll keep an eye on this when I'm on 2.34 or later.
Sorry for the confusion.

/Chris

Actually I don't blame you at all for downgrading. 2.34 is a bit flakey especially with any load, and actually I was hoping this DNS issue might be related to all of that. As you noticed, 2.33 doesn't initialize the SDK loggers so there isn't much to see in those logs.

I just submitted updates for the desktop and mobile tunnelers to the apple app stores. It can take anywhere from a couple of hours to a couple of days for the App Store review to complete - look for version 2.35 in the App Store. I didn't bother putting this version up in test flight because we're pretty confident in the fix.

1 Like

Thanks you a lot for everything! It's available and the update is installed. I'll keep an eye on it.

Thanks a lot!
/Chris