Hopefully that was (mostly) with 1.10.1 and earlier(?) One of the changes in 1.10.2 was to remove a log message that had an oversized contribution to the logging volume. I think youโve also enabled tlsuv logging, which is extremely verbose and definitely not intended for typical use.
Yeah thatโs on me, the tlsuv makes up for 99.9% of the log volume ![]()
Hi Shawn (and everyone else :)),
update on this: I dug a bit deeper into the ziti-sdk-c source code to understand what's actually going on, since this is currently a major dealbreaker for us. It turns out the controller was never the problem - it's just the tunneler's OIDC token refresh logic in the C SDK.
What happens is: when a network disruption occurs (even just a few seconds), the internal HTTP handle used for OIDC token refresh gets into a broken state. The SDK retries the refresh on the same broken handle, which keeps failing. With only a 30-second margin before token expiry (and retries every 5 seconds, so only ~6 attempts), the token expires before a refresh can succeed. Once expired, the tunneler enters the UNAUTHORIZED / "not authenticated" state we've been seeing, tears down all channels, and never recovers.
Three specific bugs in ziti-sdk-c:
- Broken HTTP handle never gets reset: after a connection failure, refresh_cb() retries on the same broken tlsuv_http_t handle indefinitely. The fix (tlsuv_http_cancel_all()) already exists in the codebase for controller shutdown, but was never used for OIDC error recovery.
- No connect timeout on the OIDC client: the controller HTTP client has a 15s connect timeout, but the OIDC client had none, inheriting the kernel default (for me ~130s). A single timed-out attempt could consume the entire refresh window.
- Fixed 30-second refresh margin:
expires_in - 30only gives 6 retry attempts. Changed to expires_in / 2, which for a typical 1800s token gives 900s (~120 attempts).
I compiled a patched build with these fixes and deployed it alongside unpatched v1.10.9 tunnelers. The patched tunneler successfully recovered from network disruptions where the v1.10.9 tunnelers entered the permanent "not authenticated" state as usual.
I've opened a PR and issue for this:
- Issue: OIDC token refresh never recovers from network disruption, permanent UNAUTHORIZED state requires restart ยท Issue #990 ยท openziti/ziti-sdk-c ยท GitHub
- PR: fix: OIDC token refresh recovery after network disruption by dmuensterer ยท Pull Request #991 ยท openziti/ziti-sdk-c ยท GitHub
This also explains the pattern I described earlier where multiple tunnelers drop sequentially - they all hit the same transient network blip, and none of them can recover.
It would be awesome if someone of the team could review it and tell me if this makes sense for you as well.
Dominic, thank you for your contributions.
I added a couple of comments with improvement ideas on your PR.
Also, add yourself to CONTRIBUTORS file if you so wish.
Thank you Eugene! I've implemented your suggestions, let me know what you think!
Unfortunately I've found out in the meantime that the original cause for the issue wasn't resolved by my code changes, even though I still think they make sense:
After deploying the patched tunnelers, they survived ~10 hours of stable operation, but then both failed at the exact same time with a different pattern.
Last successful cycle at 09:54:53 -- OIDC refresh, API calls, cert renewal all working:
Feb 11 09:54:53 ziti-edge-tunnel[1936334]: token refresh success
Feb 11 09:54:53 ziti-edge-tunnel[1936334]: successfully generated extended cert
Feb 11 09:54:53 ziti-edge-tunnel[1936334]: successfully verified extended cert. good until 2027-02-11 08:54
First failure 10 seconds later at 09:55:03 -- TCP connects, TLS completes, HTTP immediately aborted:
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: tr_connect_cb status = 0 <- TCP OK
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: handshake completed <- TLS OK
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: processing IO: events=5 status=0
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: openssl read[1]: A000126/SSL routines
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: tls read error: unexpected eof while reading
Feb 11 09:55:03 ziti-edge-tunnel[1936334]: error[-103/software caused connection abort]
This pattern repeated for 47 minutes until I manually killed the process. Every attempt: TCP succeeds, TLS completes, HTTP aborted with events=5. At 10:24 the edge router started rejecting with token is expired, confirming the token was never refreshed.
I can confirm the controller was not restarted -- same Ziti-Instance-Id throughout:
Feb 10 23:40:30 [PID 1936334]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l <- startup
Feb 11 09:54:53 [PID 1936334]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l <- last success
Feb 11 10:42:00 [PID 1946637]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l <- new ZTE process, same ID
New ZTE process worked immediately after the old one was killed:
Feb 11 10:42:00 ziti-edge-tunnel[1946637]: tr_connect_cb status = 0
Feb 11 10:42:00 ziti-edge-tunnel[1946637]: handshake completed
Feb 11 10:42:00 ziti-edge-tunnel[1946637]: 200 OK <- OIDC auth in 0.241s
Both patched tunnelers on different hosts failed at the same time, so it's not a client-side state issue. The controller wasn't restarted and a fresh process works immediately. Something on the controller side started rejecting established client connections at the application layer while still accepting new ones.
The events=5 (UV_READABLE|UV_WRITABLE) pattern is consistent with the server sending TCP FIN right after TLS handshake, before any HTTP response. This looks like post-handshake client certificate rejection in TLS 1.3.
Controller logs for that window were already rotated so I couldn't check what triggered it. I've increased log retention temporarily and will look into the controller code to understand what could cause this.
Dominic, I am working on some changes/fixes in tlsuv to address bad state of tlsuv_http_t client.
would you like to pull my branch(http-connect-debug) and test with it?
Sure. Anything specific I should do here? just deploy and log?
Nothing special, just pull the branch into your build and test your use cases
I deployed your branch http-connect-debug about 12 hours ago. Let's see!
Also I just fixed your suggestions for the PR. I'll also deploy this build to another 2 tunnelers.
I have yet to find out what the origin of the eof during the TLS connections is caused by, if it's a network/io error from my controller VPS or if it's something on the application side.
EOF usually comes from the application and indicates normal TCP shutdown handshake. I believe that our controller may close connection without sending a response in some use cases
Is this expected behavior or do you suspect a bug somewhere?
Hi Eugene,
my tunneler with your test branch just failed ![]()
From what I can see, at some point the controller accepts TCP, completes TLS, then immediately closes the connection without sending any HTTP response.
There were 156 failed refresh attempts in the 13-minute capture window. I can confirm the controller was reachable all the time since I've set up a monitoring that will curl the controller every second.
One complete failure cycle:
edge-tunnel-debug-branch.log (78.5 KB)
Any ideas?
On the controller side I am seeing lots and lots of the following error messages. Any ideas if they could be related?
Feb 13 00:44:46 zt ziti[1530746]: {"_context":"tls:0.0.0.0:8441","error":"cryptobyte: pending child length 65908 exceeds 2-byte length prefix","file":"github.com/openziti/transport/v2@v2.0.193/tls/listener.go:260","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"IPADDRESS:53150","time":"2026-02-13T00:44:46.060Z"}
Please don't mind me thinking out loud here ![]()
I am trying to think a bit outside the box to find the cause of this, since it seems we are seemingly the only ones with this issue? That must mean we do something differently than most other users.
I can't think of any edge case except that the controller is quite old (2022) and I pushed it to the new versions regularly. Maybe there's something in the certificates that lost compatibility in the meantime?
I am quite certain the breaking point for us is oidc auth. With tunnelers v.1.5.4 that don't yet support the edge-oidc endpoint we don't have this issue.
Also, our ZDE on Mac always use legacy auth even though the controller should support it: Mac Ziti Desktop Edge randomly loosing authentication - #2 by scareything
hm...
zt.yaml
v: 3
db: "/home/ziti/.ziti/quickstart/zt/db/ctrl.db"
identity:
cert: "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-client.2023-17-10.cert"
server_cert: "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem"
key: "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/keys/zt-server.key"
ca: "/home/ziti/.ziti/quickstart/zt/pki/cas.pem"
alt_server_certs:
- server_cert: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem"
server_key: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/privkey.pem"
trustDomain: "company.com"
ctrl:
listener: tls:0.0.0.0:8440
mgmt:
listener: tls:0.0.0.0:10000
healthChecks:
boltCheck:
interval: 30s
timeout: 20s
initialDelay: 30s
edge:
api:
sessionTimeout: 30m
address: zt.company.com:8441
enrollment:
signingCert:
cert: /home/ziti/.ziti/quickstart/zt/pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert
key: /home/ziti/.ziti/quickstart/zt/pki/zt-signing-intermediate/keys/zt-signing-intermediate.key
edgeIdentity:
duration: 180m
edgeRouter:
duration: 180m
web:
- name: client-management
bindPoints:
- interface: 0.0.0.0:8441
address: zt.company.com:8441
identity:
ca: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert"
key: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/keys/zt.company.com-server.key"
server_cert: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem"
cert: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-client.2023-10-17-3103.cert"
alt_server_certs:
- server_cert: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem"
server_key: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/privkey.pem"
options:
readTimeout: 5000ms
writeTimeout: 100000ms
minTLSVersion: TLS1.2
maxTLSVersion: TLS1.3
apis:
- binding: edge-client
options: { }
- binding: edge-oidc
options: { }
- binding: edge-management
options: { }
- name: management
bindPoints:
- interface: 127.0.0.1:18441
address: zt.company.com:18441
identity:
ca: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert"
key: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/keys/zt.company.com-server.key"
server_cert: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem"
cert: "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-client.2023-17-10.cert"
options:
readTimeout: 5000ms
writeTimeout: 100000ms
minTLSVersion: TLS1.3
maxTLSVersion: TLS1.3
apis:
- binding: edge-management
options: { }
- binding: zac
options:
location: /home/ziti/.ziti/quickstart/zt/ziti-console
indexFile: index.html
- binding: fabric
options: { }
metrics:
reportInterval: 20s
events:
edgeSessionLogger:
subscriptions:
- type: edge.sessions
include:
- created
handler:
type: file
format: json
maxsizemb: 50
maxbackups: 2
path: /var/log/ziti/edge-sessions.log```
How big is your controllers CA bundle and how many external CAs do you have enabled?
That error happens server side when the TLS server (controller/router) tries to send back all the DNs it supports and TLS only allows 65k worth of those - which is a TON. The only way that could be blowing out is if you have a huge number of certs in your CA bundle (e.g. something appending over and over) or a huge number of external CAs enabled.
Most DNs fall into the 80-200 byte range meaning it can hold 324-799 DNs.
That would happen on all requests, right? I've seen happen sporadically on the same controller.
Maybe it has something to do with throttling logic? Don't we throttle pre-TLS handshake?
It could be on different in the same controller depending on xweb bind points and if those bind points have custom CA bundles, but it would be on each request to the bind point. For routers there isnโt multiple bind points allowed.
I havenโt considered the throttling angle. I donโt see why throttling would set so many DNs there unless it is a fragmented/mangled packet that goes out.
The throttling does close TLS handshakes. However, it wouldn't result in the error:
"cryptobyte: pending child length 65908 exceeds 2-byte length prefix"
Still, I did find a bug in the throttling logic, so that could be related. @dmuensterer do you have TLS handshake throttling enabled? It was disabled by default until 1.8.0-pre5, where it was enabled by default, which is when I found the throttling bug.
Thank you,
Paul
I thought about this as well. However I don't have many CAs in my certs. In the cas.pem only 4:
# for f in \
pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem \
pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem \
le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem \
pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert \
pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert \
pki/zt-intermediate/certs/zt-client.2023-17-10.cert \
pki/cas.pem
do
echo "=== $f ==="
wc -c "$f"
grep -c "BEGIN CERTIFICATE" "$f"
echo
done
=== pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem ===
4184 pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem
2
=== pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem ===
4265 pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem
2
=== le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem ===
2880 le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem
2
=== pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert ===
2094 pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert
1
=== pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert ===
2086 pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert
1
=== pki/zt-intermediate/certs/zt-client.2023-17-10.cert ===
2053 pki/zt-intermediate/certs/zt-client.2023-17-10.cert
1
=== pki/cas.pem ===
8282 pki/cas.pem
4
I don't have any external CA's configured:
ziti edge list cas
โญโโโโโฌโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโฎ
โ ID โ NAME โ FLAGS โ TOKEN โ FINGERPRINT โ CONFIGURATION โ
โโโโโโผโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโฌโโโโโโโฌโโโโโโโค
โฐโโโโโดโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโโโโโโโดโโโโโโดโโโโโโโดโโโโโโโฏ
results: none
Flags: (V) Verified, (A) AutoCa Enrollment, (O) OttCA Enrollment, (E) Authentication Enabled
Dominic,
I took the liberty of updating your PR/branch to pull our latest changes (including tlsuv). Please, let us know how the testing goes.
we are ready merge the PR. (there is a small issue of missing strndup on Windows)