SSH over ziti tunnel hangs often

Howdy folks - I’ve been a programmer and general admin for a long time, including network programming for parts of it, but I’m new to OpenZiti and zero-trust in general, so bear with me if I get some of the terminology slightly wrong (or very wrong).

I have the following set up, all using the Open Source stuff (I’m not using NetFoundry paid anything). I set up the first part (db service tunneling) under the guidance of an OpenZiti expert, then I added in the next part (ssh tunneling) myself.

  • macOS Ziti Desktop Edge 2.24 (458) (2 machines)
  • 2 device identities, 1 user identity (default admin), 1 router identity
  • one edge router, installed on a digitalocean “cpu optimized” 2vcpu 4gb ram machine which runs a few other small services, all “non production” stuff not getting much traffic
  • controller v0.27.2 ZAC: 2.5.1 running on same DO machine
  • 3 Services: a DB on another machine on the DO private network, SSH on another machine on the DO private network, and SSH on the “localhost” of the edge router machine
  • “magic” DNS so I ssh and psql to a ziti-only hostname on my macOS machines

I use all that to connect from my laptop (one of the two device identity machines) to the database on the private network (postgres), and to the two SSH servers (one of which is also available on the public internet, but I go through the tunnel).

Here’s what I’ve run in to: Frequently (but NOT ALWAYS) the SSH sessions will hang. No more text will appear on my screen despite any keystrokes. Control-C does not exit out. These “feel like” where the SSH client would normally, eventually, tell you the server connection was severed and exit, but when going over the Ziti tunnel they just hang forever (or at least 24 hours in one instance). This SEEMS TO happen when “pushing a lot of data” across it (like rsync, or big builds outputting a lot of text). And SOMETIMES I will see it spit a few thousand characters out, pause for a moment, spit out another few thousand, then go back to normal (or hang indefinitely). ONCE I was able to get it to do some “extra spitting out” by sending some keystrokes, almost as if it was waiting for data to be transmitted before it would receive more. But that was just one time :shrug:.

When this happens, I can immediately open up a new ssh-over-ziti and start over what I was doing, which works fine until that one hangs. OR, I can immediately SSH over the public internet (to the one with that open), and everything works fine through that indefinitely, including overnight just sitting there.

During these occurrences, I do not get any “disconnect / reconnect” macOS notification from Ziti Desktop Edge, like I do on very rare occasions like when spectrum (cable internet provider) goes down for a moment.

Regarding the postgres connection, I have not experienced any complaints from myself or the other person who uses the postgres tunnel (she doesn’t use the SSH tunnel).

I have not restarted the DO server, or any of the OpenZiti services running on it, since setting it up a couple weeks ago. I prefer not doing this until explicitly directed to do so as part of a troubleshooting process, because for this stuff it’s not really acceptable to “just restart it once in a while”, in my opinion, at least at first until I find out that’s the only option.

Ring any bells? Where do I start troubleshooting this? I’ve done cursory looks for log files, admittedly not exhaustively, and not come up with anything.

Thanks y’all for your time and help!

Jason

Howdy @woodwardjd, welcome to the community and thanks for the post. I’ve not seen any reports of tunnelers “hanging” like this but I, myself, get this exact behavior on AWS T2.micro sized VMs when I’ve done a lot of CPU/IO. It has always felt to me like the cloud provider was basically putting my VM into “timeout” until it earns back enough credits. At that point it recovers and things are fine. Could it be possible that you’re hitting some kind of cloud-provider limit and getting throttled? From your description, it almost sounds like it’s not recovering properly. I’ve not seen that behavior and we’d definitely need to fix that if that’s the case.

Regardless, looking at the logs would probably be useful. It would be interesting to look in the Ziti Desktop Edge for Mac logs to see if there’s any interesting/useful/helpful log messages. It’d also be useful to look at the logs from the router (and possibly the controller, but usually it’s the router/tunnelers logs that are most useful). If you’re ok with providing those logs, you can get them from the router using something like journalctl (if you used/enabled the router with systemd) and from the Mac they are able to be found via the UI. You could send those to help at openziti.org if you’re ok with sending those to us and we could look at them too.

hahah. Well if it’s necessary to fix it ‘for now’, perhaps, but only to figure out if there’s a bug. Long-term, that’s definitely not acceptable, we all will agree with you there! :slight_smile:

Let’s peek at the logs, I’ll poke some people to check this thread out and go from there

Thanks! I’ll get those logs. For ZDEMac, I don’t see anything in the UI for getting logs. It doesn’t have any menu options besides about and quit, and there isn’t a log thing in the main panel or in the gear-invoked Config dialog. I googled and couldn’t find a web page for it, only for the Windows ZDE. I do see some stuff in macOS Console, filtering for “ziti”, if that’s what I should be looking for?

Here’s a screen cap for the logs on MacOS

image

There is a menu under the lightning bolt-looking thing in the menubar. Logging/Packet Tunnel will be the more interesting log. You can also find them in ~/Library/Group Containers/MN5S649TXM.ZitiPacketTunnel.group/logs

Oh lol I hadn’t noticed that menu, just the app’s menu! Thank you.

That’s a little juicier (Packet Tunnel logs on macOS)! There’s “attempted write in invalid state” and “received message without conn_id or for unknown connection” and “channel.c:816 channel_alloc_cb() can’t alloc message” and others. I can send stuff from last night, when I can’t corroborate exact failure times, or wait until it happens again and give you a time window. Which would be preferable? The latter? or Both? or doesn’t matter?

If you are fine with just sending the logs, just send them and we’ll take a look. It’ll more interesting to try to figure out how to debug “what happened, how, and why”… :slight_smile: Send the logs over and we’ll peek and see if anything pops out at us and go from there.

1 Like

I too have observed the SSH hung condition, but I haven’t gotten around to properly reproducing it with debug logs yet. Maybe now’s the time. :grimacing:

It’s fairly repeatable, but I’m not sure what causes it. The symptom is that a wall of text in terminal I’m using over SSH, e.g., tailing a verbose log, will cause the terminal to become unresponsive. After a few minutes SSH times out and drops the session and I can reconnect. I’m not even sure whether the tunneler on either end was bounced automatically, but I don’t normally have to do anything to resurrect the tunneler itself, just reconnect SSH after it finally drops the session.

these messages are generally benign.
can't alloc message - means upstream is being throttle because client does not read fast enough
received message without conn_id or for unknown connection - is usually when the client disconnects but there is still data arriving for it

I just experienced the failure again. This session was open for a long time (>24 hours) and failed during a lots-of-data-pushed-across situation. This lots-of-data was output from an rsync -av, which shows files transferred, which was happening on the remote machine to itself, not moving files across the ssh session.

NEW THIS TIME I had a concurrent ssh already open (not over the ziti network, unfortunately) and looked to see if the rsync process was still running or the other user was still logged in. Neither the rsync was running nor the user was logged in still, moments after the failure. On the client side it is frozen, which I imagine could be any number of things (ssh client configuration, for instance). I can find the ssh process in ps, and it terminates when I kill (default TERM) it.

Thanks so much for the information. @ekoby has made a change today that we think “might” help/fix this particular problem. We haven’t had a chance to test it out yet, but we’ll keep you informed. We’ll follow up in a day or so

1 Like

Just following up here. Our latest tests didn’t fix the problem. We’re still trying to reproduce exactly and find/fix the issue.

Thanks for the followup! I want to see if my uneducated guess is correct so I’m going to write it here and then we’ll see! The bug seems to be triggered by the “wall of text” experience (but not all wall-o-texts). I’ve found many times (all?) that if I “spidey sense” it happening soon enough, I can press enter (send some carriage returns) several times in quick succession, and with each it’ll spit out another block of text (not big enough to be 1024 bytes, maybe 100s of bytes or so). So I’m betting there’s some sort of buffer-send that’s only triggered on data-receive. Normally it doesn’t require this data-receive to send data when available, but it somehow gets into a state where it does.

:point_up: the internet version of the “sealed envelope” lol

Thanks, that’s the sort of useful comment that might lead to an ‘ah-ha’ moment. Appreciate that observation. If anything else strikes you, please feel free to share, even if you think “it can’t be ${this}, could it???” :slight_smile: I think we both know sometimes it is ${this}! :slight_smile:

Could this be the problem (MTU). SSH Frequently Asked Questions