Connection refused when using a browser

Hi *,
I have a ziti-controller and a ziti-router running in a k8s cluster. The router acts as tunneler to cluster-internal services. I exposed the controller and the edge listener via ingress controller.
I can then start ziti-edge-tunnel on my local machine and connect to the ziti fabric. I exposed some
cluster internal web services on the ziti overlay: the hello toy service and grafana. I can then issue curl http://hello.ziti locally and get the website back. So this works. It's also quite stable, I can hit the web site in a for loop on the shell 1000 times and it will deliver without problems. This also works in the browser. I can also access grafana, at least when I expose the port directly, foregoing ziti, without a problem. I can access grafana.ziti using curl via ziti, as long as I request one URL at a time: I can run wget -r http://grafana.ziti in a loop without issues.

Once i use the browser to access grafana.ziti, I get confronted with stalled connections, which end up being reported as "connection refused" by the browser, for files that would normally deliver. Some connections go through and deliver parts of the web page, but this will quickly end. At that point the ziti fabric becomes completely unresponsive, even for the plain curl requests to hello.ziti before. After a while this "normalizes" to the previous state. I have not changed any rate limiting options and used the provided Helm charts for deployment. Here are some excerpts from the controller logs captured during the browser requests:

{"file":"github.com/openziti/ziti/controller/network/fault.go:32","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"network fault processing for [1] circuits","time":"2024-10-24T00:15:39.540Z"}
{"circuitId":"K8Y2dhISh","file":"github.com/openziti/ziti/controller/network/fault.go:49","func":"github.com/openziti/ziti/controller/network.(*Network).fault","level":"info","msg":"sent unroute for circuit to router in response to forwarding fault","routerId":"Nv4gLB6J0d","time":"2024-10-24T00:15:39.541Z"}

and the router

"_context":"{c/501Pzb5Sh|@/bqxn}\u003cTerminator\u003e","file":"github.com/openziti/ziti/router/xgress/xgress.go:575","func":"github.com/openziti/ziti/router/xgress.(*Xgress).flushSendThenClose.func1","level":"info","msg":"sending end of circuit payload","time":"2024-10-24T11:00:18.537Z"}
{"_context":"{c/501Pzb5Sh|@/4b8x}\u003cInitiator\u003e","circuitId":"501Pzb5Sh","file":"github.com/openziti/ziti/router/xgress/xgress.go:491","func":"github.com/openziti/ziti/router/xgress.(*Xgress).tx.func3","level":"info","msg":"payload sent [0 B]","origin":1,"seq":1,"time":"2024-10-24T11:00:18.538Z"}
{"_context":"{c/501Pzb5Sh|@/4b8x}\u003cInitiator\u003e","file":"github.com/openziti/ziti/router/xgress/xgress.go:575","func":"github.com/openziti/ziti/router/xgress.(*Xgress).flushSendThenClose.func1","level":"info","msg":"sending end of circuit payload","time":"2024-10-24T11:00:18.538Z"}
{"_context":"{c/501Pzb5Sh|@/4b8x}\u003cInitiator\u003e","circuitId":"501Pzb5Sh","error":"cannot forward payload, no destination for circuit=501Pzb5Sh src=4b8x dst=bqxn","file":"github.com/openziti/ziti/router/handler_xgress/receive.go:37","func":"github.com/openziti/ziti/router/handler_xgress.(*receiveHandler).HandleXgressReceive","level":"error","msg":"unable to forward payload","origin":0,"seq":3,"time":"2024-10-24T11:00:18.538Z"}
{"circuitCount":1,"ctrlId":"ziti-controller-ctrl-plane-identity","file":"github.com/openziti/ziti/router/forwarder/faulter.go:107","func":"github.com/openziti/ziti/router/forwarder.(*Faulter).run","level":"warning","msg":"reported forwarding faults","time":"2024-10-24T11:00:25.979Z"}

ziti fabric list circuits lists a number of circuits to the service. I suspect that the issue lies somehow with the amount of parallel connections that the browser opens. I am not sure how to proceed debugging this and would appreciate any pointers.

Thanks!

Hi @robbje, welcome to the community and OpenZiti (and zrok/BrowZer)!

That's -- unexpected. I don't believe I've ever seen a report like this before. We use OpenZiti in front of our own Grafana instance and I've never noticed a problem like this.

I'll have to point some other people at this post and see if they have any thoughts on what might be best to do next.

My first guess would be that you are running out of available ports for some reason, especially since it normalizes after some time, exactly as you surmise. I could see the connections timing out and releasing the socket to be reused, which would return the system to functioning. We've seen this behavior in very large networks before, but there could be a limitation in K8s I'm not aware of that would mimic it. When you say there are a "number" of circuits how many zeroes could be on that number?

I see about 10 circuits. I can't imagine that's the reason. The cluster is not being used for anything else, all nodes idle and retrieving one website (and dependencies) runs into this issue already. I used grafana as an example, but other websites/apps exhibit the same behavior. The hello toy app serves just one resource to the browser, hence it's (probably) not impacted. I also feel like errors on syscalls should show up in the logs, but I've seen nothing of the sort.

I've played around with it a bit more, but have gotten nowhere. Here are some more log lines, maybe something strikes the eye:

What I did was run ziti-edge-tunnel and direct firefox to http://grafana.ziti/login
Note that some requests finish fine with 200, but they are not shown in the screenshot.

About to run tunnel service... ziti-edge-tunnel
(791466)[        0.000]    INFO ziti-sdk:utils.c:198 ziti_log_set_level() set log level: root=3/INFO
(791466)[        0.000]    INFO ziti-sdk:utils.c:169 ziti_log_init() Ziti C SDK version 1.1.2 @ge35caa6(HEAD) starting at (2024-10-27T20:27:36.754)
(791466)[        0.000]    INFO ziti-edge-tunnel:instance-config.c:85 load_tunnel_status_from_file() Loading config file from /var/lib/ziti/config.json
(791466)[        0.004]    WARN ziti-sdk:model_support.c:202 model_parse() json parse error: expected comment
(791466)[       66.614]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:59874 err=-13, terminating connection
(791466)[       66.614]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:59862 err=-13, terminating connection
(791466)[       66.614]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:59856 err=-13, terminating connection
(791466)[       70.305]   ERROR ziti-sdk:channel.c:649 latency_timeout() ch[0] no read/write traffic on channel since before latency probe was sent, closing channel
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.8/dmiXuAh2/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.7/pfjOQMRR/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.6/9bA8tUcF/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.5/pqo9nUq6/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.4/pnur-faV/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]   ERROR ziti-sdk:connect.c:1054 connect_reply_cb() conn[0.3/vQgsCZBi/Connecting] failed to connect [-20/not a directory]
(791466)[       70.305]   ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
(791466)[       70.305]    WARN ziti-sdk:connect.c:1375 queue_edge_message() conn[0.5/pqo9nUq6/Closed] disconnecting from state[7]
(791466)[       70.305]    WARN ziti-sdk:connect.c:1375 queue_edge_message() conn[0.4/pnur-faV/Closed] disconnecting from state[7]
(791466)[       70.305]    WARN ziti-sdk:connect.c:1375 queue_edge_message() conn[0.3/vQgsCZBi/Closed] disconnecting from state[7]
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33786 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33790 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33798 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33814 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33820 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33834 err=-13, terminating connection
(791466)[      127.174]   ERROR tunnel-sdk:tunnel_tcp.c:190 on_tcp_client_err() client=tcp:100.64.0.1:33838 err=-13, terminating connection

Have you looked at you config file to see if there are any issues? I noted this in the log, which I haven't seen before that I recall.

(791466)[        0.004]    WARN ziti-sdk:model_support.c:202 model_parse() json parse error: expected comment

It seems relevant if you look at the error code (-13)

Yeah, I tried removing the config file and re-ran ziti-edge-tunnel:

ziti-edge-tunnel run --identity-dir /opt/openziti/etc/identities
About to run tunnel service... ziti-edge-tunnel
(911053)[        0.000]    INFO ziti-sdk:utils.c:198 ziti_log_set_level() set log level: root=3/INFO
(911053)[        0.000]    INFO ziti-sdk:utils.c:169 ziti_log_init() Ziti C SDK version 1.1.2 @ge35caa6(HEAD) starting at (2024-10-28T11:42:42.975)
(911053)[        0.000]    INFO ziti-edge-tunnel:instance-config.c:85 load_tunnel_status_from_file() Loading config file from /var/lib/ziti/config.json
(911053)[        0.000]   ERROR ziti-edge-tunnel:instance-config.c:60 load_config_from_file() The config file No such file or directory cannot be opened due to /var/lib/ziti/config.json. This is normal if this is a new install or if the config file was removed manually
(911053)[        0.004]    WARN ziti-sdk:model_support.c:202 model_parse() json parse error: expected comment
(911053)[        0.004]   ERROR ziti-edge-tunnel:instance-config.c:135 save_tunnel_status_to_file() Could not copy config file [/var/lib/ziti/config.json] to backup config file, the config might not exists at the moment

Is there a way to get more details as to what this warning is about? I did not consider the WARN to be important, because the tunnel works in most cases.

Anyway, I believe these error codes come from the OS, so it should be -EPERM, which is puzzling, because at first the tunnel works without issues. I don't have any SELINUX/apparmor restrictions.

I'm looking for ways to debug this myself, as I doubt that this can be easily reproduced on someone else's machines. Can I, for instance, enable verbose logging for the ziti-controller and ziti-router running in k8s via the configmap? I don't think I saw a possibilty in the docs other than providing a command line argument, which would require me to use different container images.

You can change the logging level via IPC in run time.
If you set it up to debug, remember to turn it back down after you collect the process time span or you may blow your volume out when you're not looking.

sudo ziti agent set-log-level $(pidof ziti-process) [debug/verbose/info]

Unfortunately, the increased log level did not yield anything interesting. I assume that firefox will open a socket to the server and send a SYN request, which will prompt ziti-edge-tunnel to contact the controller to establish a circuit. Once that has happened the SYN request will be forwarded to the circuit and end up with the target service. If this is correct - and observing ss -atpn, I see multiple connections waiting for circuits to be established, like this:

SYN-SENT   0      1          100.64.0.1:56344      100.64.0.3:80    users:(("firefox",pid=2208,fd=301))

SYN was sent and it's queued, waiting for the circuit to establish. This does seem to work, but after the third request (or so) it stops working, even for other services.
Is there a way to trace circuit creation specifically in the ziti controller and router?
Are there any mechanisms that come to mind (maybe some kind of flow control) that prevent circuits from being created based on some dynamic conditions?

So -13 apparently is delivered by lwip, the userland tcp stack. I had someone else test the connection on their machine and it works flawlessly for them. So it's my local setup that ziti-edge-tunnel refuses to work on properly. Curiously the same is also true from within a virtual machine, like a fresh ubuntu machine. Not sure where to go from here, but I don't think it's a ziti problem at this point. I tried a different machine in my place, connected via cellular network instead of my home internet connection and get... the same problem. Not sure what's left to try.