Hi @scareything ,
I have been keeping an eye on it and I think i have an idea of when the problems occur: It seems to be related to when I use my IaC tool 'pulumi' in parallel. Pulumi opens quite a few sessions for all the targets it handles (all over OpenZiti in my scenario).
Yesterday I had two spontaneous SSH session disconnects - the first at 21:15 and the second at ~21:52.
Now that I have an idea how to reproduce it, I tried to build a test setup and trigger it. But it is not easy to reproduce. There is also a video of the session in the collected data, and I already thought I wouldn't be able to reproduce it. On the video you can see 4 terminals (at least from minute 3 on):
- The top one is a
k9s
session showing some Kubernetes pods being created and deleted by pulumi. - The next terminal shows a tcpdump of the ziti interface
- The 3rd is the pulumi output
- The bottom terminal shows an SSH session. In this case the same box where also the Kubernetes lives.
I tried several loops and it worked fine. But then ~13:12 (video playtime) - I just typed some text into the terminal - the session froze. And then at 13:25 the SSH session terminated and also the pulumi exited with a read error.
At this point it appears that pulumi was uploading it's 'state' data to the s3 storage - so it was doing a bulk upload. The fact with the s3 upload makes me think that this might be related in some way to this thing plorenz is on: https://openziti.discourse.group/t/how-to-track-down-throughput-problems-on-the-ziti-router/ . But still - I don't have such disconnects with the 2.33 version. To correlate the logs with the video: Take the timestamp from the tcpdump output and offset it by -1 hour.
I hope you can get an idea of what's going on from the logs.
Thanks & Bye,
Chris