Tunneler stop working

I have an application running in my network and I set up tunneler running as a “host” for this application:

ziti edge create config app.host host.v1 '{"protocol":"tcp", "address":"app.com", "port":'443'}'
ziti edge create config app.intercept intercept.v1 '{"protocols":["tcp"],"addresses":["app.com"], "portRanges":[{"low":'443', "high":'443'}]}'

ziti edge create service app.service --configs "app.host,app.intercept"

ziti edge create service-policy app.bind Bind --service-roles '@app.service' --identity-roles '#server'

ziti edge create service-policy app.dial Dial --service-roles '@app.service' --identity-roles '#client' 

It worked fine for a few days and then it stopped working. It came back up once I rebooted the tunneler.
Then, it happened again a few days later.

The tunneler is running in a Ubuntu server on AWS and these are the logs when the issue happened:

Feb 17 12:11:45 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   773933.759]    WARN tunnel-cbs:ziti_hosting.c:250 on_hosted_ziti_write() ziti write error: -19(Operation did not complete in time), stop reading peer
Feb 17 12:12:25 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   773973.107]    WARN tunnel-cbs:ziti_hosting.c:250 on_hosted_ziti_write() ziti write error: -19(Operation did not complete in time), stop reading peer
Feb 17 12:12:28 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   773976.451]    WARN ziti-sdk:connect.c:1501 process_edge_message() conn[0.14843/Disconnected] data[59 bytes] received in state[Disconnected]
Feb 17 12:12:28 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   773976.451]    WARN ziti-sdk:connect.c:1501 process_edge_message() conn[0.14843/Disconnected] data[59 bytes] received in state[Disconnected]
Feb 17 12:12:28 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   773976.452]    WARN ziti-sdk:connect.c:1501 process_edge_message() conn[0.14843/Disconnected] data[75 bytes] received in state[Disconnected]
Feb 17 12:17:01 openziti-tunneler CRON[139732]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Feb 17 12:23:20 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774628.248]   ERROR tunnel-cbs:ziti_hosting.c:596 on_hosted_client_connect() hosted_service[app.service], client[client1]: getaddrinfo(app-prod.organizationinternational.com,443) failed: System error
Feb 17 12:23:21 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774629.645]   ERROR tunnel-cbs:ziti_hosting.c:596 on_hosted_client_connect() hosted_service[app.service], client[client1]: getaddrinfo(app-prod.organizationinternational.com,443) failed: System error
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774630.978]   ERROR tunnel-cbs:ziti_hosting.c:596 on_hosted_client_connect() hosted_service[app.service], client[client2]: getaddrinfo(app-prod.organizationinternational.com,443) failed: System error
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.005]   ERROR tunnel-cbs:ziti_hosting.c:596 on_hosted_client_connect() hosted_service[app.service], client[client2]: getaddrinfo(app-prod.organizationinternational.com,443) failed: System error
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.705]   ERROR ziti-sdk:ziti_ctrl.c:154 ctrl_resp_cb() ctrl[openziti-controller.organization.systems] request failed: -24(too many open files)
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.705]   ERROR ziti-sdk:ziti_ctrl.c:154 ctrl_resp_cb() ctrl[openziti-controller.organization.systems] request failed: -24(too many open files)
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.705]   ERROR ziti-sdk:ziti.c:1052 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/too many open files] from ctrl[https://openziti-controller.organization.systems:8441]
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.705]    WARN tunnel-cbs:ziti_tunnel_ctrl.c:739 on_ziti_event() ziti_ctx controller connections failed: Ziti Controller is not available
Feb 17 12:23:23 openziti-tunneler ziti-edge-tunnel[20547]: (20547)[   774631.705]   ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1192 on_event() ztx[/opt/openziti/etc/identities/Tunneler.json] failed to connect to controller due to Ziti Controller is not available



A few questions:

  • what version of the ziti-edge-tunnel are you running?
  • when it gets in that state can you execute lsof -p <pid of ziti-edge-tunnel process>
  • check your limits ulimit -a

Thanks for the quick response.

Environment

OS: Linux
Arch: x86_64
Release: 5.15.0-1030-aws
Version: #34-Ubuntu SMP Mon Jan 23 20:13:32 UTC 2023

Sdk

Version: 0.30.9
Id: ziti-edge-tunnel
App Version: v0.20.18-local
Type: ziti-sdk-c
Branch: HEAD
Revision: 743d7f8


About the limits, I get this result but I am not sure what I need to look for:

$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 15630
max locked memory           (kbytes, -l) 501884
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 15630
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

@ekoby Hey it happened again and I think I found the root cause thanks to your command:

lsof -p <pid of ziti-edge-tunnel process>`

It looks like it tries to open more than 1024 files:

buntu@openziti-tunneler:~$ sudo lsof -p 506
\\COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF   NODE NAME
...

ziti-edge 506 root  905u     IPv4             555740      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:60822->ec2*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)
ziti-edge 506 root  906u     IPv4             556348      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:60834->ec2-*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)
ziti-edge 506 root  907u     IPv4             558461      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:60578->ec2-*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)
ziti-edge 506 root  908u     IPv4             556082      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:60828->ec2-*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)
ziti-edge 506 root  909u     IPv4             556086      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:60836->ec2-*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)
ziti-edge 506 root  910u     IPv4             556096      0t0    TCP ip-*-*-*-*.eu-west-1.compute.internal:53344->ec2-*-*-*-*.eu-west-1.compute.amazonaws.com:https (CLOSE_WAIT)

I have 1023 of these lines. Once it reaches 1024 is when I get the error.
Any idea why it is happening?

Thanks!!

After doing some more troubleshooting, it looks like OpenZiti is not closing connections and just leaves them in CLOSE_WAIT status.

For example, I see CLOSE_WAIT connections that are hours old

Is it possible to change something on the tunneler side to make it close connections that are in CLOSE_WAIT status?

Thank you for providing additional information.
I’ve noticed that you’re a couple releases behind.
Our latest release has some refactoring of the code that handles those hosting side connections.
Could you download and try the latest release?

I’ll do that tonight and let you know how it went. Thanks!

@ekoby Unfortunately the issue is still happening after the upgrade:

Environment

OS: Linux
Arch: x86_64
Release: 5.15.0-1030-aws
Version: #34-Ubuntu SMP Mon Jan 23 20:13:32 UTC 2023

Sdk

Version: 0.31.2
Id: ziti-edge-tunnel
App Version: v0.20.21-local
Type: ziti-sdk-c
Branch: HEAD
Revision: c74ab09


There are hundreds of sessions in “CLOSE_WAIT” status that never go away

Hi @Foles_90. @ekoby is out for a few days, so I’ll try to help keep digging on this issue because it is very important to us. Thanks for your patience and your help!

This may be old news, but I’ll just summarize what I’m taking away so far: the connections that are accumulating are with the ziti controller, and the fact that they are in CLOSE_WAIT suggests that the controller is initiating the close (sending a TCP FIN) and ziti-edge-tunnel for some reason its not closing its side of the connection (sending FIN/ACK) when the FIN is received from the controller.

I’m not sure why ziti-edge-tunnel isn’t able to close its side of the connections in your environment. I tried several scenarios here and so far I’ve been unable to get connections stuck in CLOSE_WAIT. In fact I have a hard time even seeing the connections while they are in the ESTABLISHED state.

It would be interesting to see the connection states on your ziti-controller’s host (or the proxy/LB that it is behind) if that’s possible. I suspect we’d see a similar number of connections in the TIME_WAIT state, but it would be interesting to see.

Also I wonder if there may be an early sign of the connection issues with the controller if you go back to earlier time periods in your log? If you could share the ziti-edge-tunnel log that shows a complete lifecycle then that would be helpful. The I know you’re prudent with sharing your logs here, so feel free to trim/sanitize it especially if you see something that looks like it could be related to controller connections. You could also send the log directly to me if you’re just concerned about putting things out on the Internet for eternity. Whatever you are comfortable with.

Thanks again!

Hi @scareything , sorry for the late reply, I was away for a few days.

Here is a diagram of how everything is now:

The connections that are accumulating are between the tunneler and the WebServer.

What logs do you think will be useful?

The logs from ziti-edge-tunnel with more context might be helpful. Preferably a log that shows the process starting and the first connection error. Thanks.

I shared the logs over a DM.

Thanks!

Thanks for the logs, I got them and saw some interesting/unexpected things. Could you please produce logs at VERBOSE level (by running ziti-edge-tunnel with -v5)?

btw we’ll be releasing ziti-edge-tunnel 0.21.0 shortly, which includes a dependency update that may fix the file descriptor leak that you’re seeing. Could you please try with 0.21.0 and share VERBOSE logs if you’re still seeing the leak? Thank you!