Controller v1.6.8 disconnects tunnelers

scareything · December 23, 2025, 11:38pm

Hopefully that was (mostly) with 1.10.1 and earlier(?) One of the changes in 1.10.2 was to remove a log message that had an oversized contribution to the logging volume. I think you’ve also enabled tlsuv logging, which is extremely verbose and definitely not intended for typical use.

dmuensterer · December 23, 2025, 11:50pm

Yeah that’s on me, the tlsuv makes up for 99.9% of the log volume

dmuensterer · February 11, 2026, 7:57am

Hi Shawn (and everyone else :)),

update on this: I dug a bit deeper into the ziti-sdk-c source code to understand what's actually going on, since this is currently a major dealbreaker for us. It turns out the controller was never the problem - it's just the tunneler's OIDC token refresh logic in the C SDK.

What happens is: when a network disruption occurs (even just a few seconds), the internal HTTP handle used for OIDC token refresh gets into a broken state. The SDK retries the refresh on the same broken handle, which keeps failing. With only a 30-second margin before token expiry (and retries every 5 seconds, so only ~6 attempts), the token expires before a refresh can succeed. Once expired, the tunneler enters the UNAUTHORIZED / "not authenticated" state we've been seeing, tears down all channels, and never recovers.

Three specific bugs in ziti-sdk-c:

Broken HTTP handle never gets reset: after a connection failure, refresh_cb() retries on the same broken tlsuv_http_t handle indefinitely. The fix (tlsuv_http_cancel_all()) already exists in the codebase for controller shutdown, but was never used for OIDC error recovery.
No connect timeout on the OIDC client: the controller HTTP client has a 15s connect timeout, but the OIDC client had none, inheriting the kernel default (for me ~130s). A single timed-out attempt could consume the entire refresh window.
Fixed 30-second refresh margin: expires_in - 30 only gives 6 retry attempts. Changed to expires_in / 2, which for a typical 1800s token gives 900s (~120 attempts).

I compiled a patched build with these fixes and deployed it alongside unpatched v1.10.9 tunnelers. The patched tunneler successfully recovered from network disruptions where the v1.10.9 tunnelers entered the permanent "not authenticated" state as usual.

I've opened a PR and issue for this:

This also explains the pattern I described earlier where multiple tunnelers drop sequentially - they all hit the same transient network blip, and none of them can recover.

It would be awesome if someone of the team could review it and tell me if this makes sense for you as well.

ekoby · February 11, 2026, 4:59pm

Dominic, thank you for your contributions.

I added a couple of comments with improvement ideas on your PR.

Also, add yourself to CONTRIBUTORS file if you so wish.

dmuensterer · February 11, 2026, 7:13pm

Thank you Eugene! I've implemented your suggestions, let me know what you think!

Unfortunately I've found out in the meantime that the original cause for the issue wasn't resolved by my code changes, even though I still think they make sense:

After deploying the patched tunnelers, they survived ~10 hours of stable operation, but then both failed at the exact same time with a different pattern.

Last successful cycle at 09:54:53 -- OIDC refresh, API calls, cert renewal all working:

  Feb 11 09:54:53 ziti-edge-tunnel[1936334]: token refresh success
  Feb 11 09:54:53 ziti-edge-tunnel[1936334]: successfully generated extended cert
  Feb 11 09:54:53 ziti-edge-tunnel[1936334]: successfully verified extended cert. good until 2027-02-11 08:54

First failure 10 seconds later at 09:55:03 -- TCP connects, TLS completes, HTTP immediately aborted:

  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: tr_connect_cb status = 0            <- TCP OK
  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: handshake completed                 <- TLS OK
  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: processing IO: events=5 status=0 
  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: openssl read[1]: A000126/SSL routines
  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: tls read error: unexpected eof while reading
  Feb 11 09:55:03 ziti-edge-tunnel[1936334]: error[-103/software caused connection abort]

This pattern repeated for 47 minutes until I manually killed the process. Every attempt: TCP succeeds, TLS completes, HTTP aborted with events=5. At 10:24 the edge router started rejecting with token is expired, confirming the token was never refreshed.

I can confirm the controller was not restarted -- same Ziti-Instance-Id throughout:

  Feb 10 23:40:30 [PID 1936334]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l  <- startup
  Feb 11 09:54:53 [PID 1936334]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l  <- last success
  Feb 11 10:42:00 [PID 1946637]: Ziti-Instance-Id: cmlgmesep00004q7ogpa3246l  <- new ZTE process, same ID

New ZTE process worked immediately after the old one was killed:

  Feb 11 10:42:00 ziti-edge-tunnel[1946637]: tr_connect_cb status = 0
  Feb 11 10:42:00 ziti-edge-tunnel[1946637]: handshake completed
  Feb 11 10:42:00 ziti-edge-tunnel[1946637]: 200 OK    <- OIDC auth in 0.241s

Both patched tunnelers on different hosts failed at the same time, so it's not a client-side state issue. The controller wasn't restarted and a fresh process works immediately. Something on the controller side started rejecting established client connections at the application layer while still accepting new ones.

The events=5 (UV_READABLE|UV_WRITABLE) pattern is consistent with the server sending TCP FIN right after TLS handshake, before any HTTP response. This looks like post-handshake client certificate rejection in TLS 1.3.

Controller logs for that window were already rotated so I couldn't check what triggered it. I've increased log retention temporarily and will look into the controller code to understand what could cause this.

ekoby · February 11, 2026, 8:36pm

Dominic, I am working on some changes/fixes in tlsuv to address bad state of tlsuv_http_t client.
would you like to pull my branch(http-connect-debug) and test with it?

dmuensterer · February 11, 2026, 8:53pm

Sure. Anything specific I should do here? just deploy and log?

ekoby · February 11, 2026, 8:55pm

Nothing special, just pull the branch into your build and test your use cases

dmuensterer · February 12, 2026, 8:17am

I deployed your branch http-connect-debug about 12 hours ago. Let's see!
Also I just fixed your suggestions for the PR. I'll also deploy this build to another 2 tunnelers.

I have yet to find out what the origin of the eof during the TLS connections is caused by, if it's a network/io error from my controller VPS or if it's something on the application side.

ekoby · February 12, 2026, 1:54pm

EOF usually comes from the application and indicates normal TCP shutdown handshake. I believe that our controller may close connection without sending a response in some use cases

dmuensterer · February 12, 2026, 4:38pm

Is this expected behavior or do you suspect a bug somewhere?

dmuensterer · February 12, 2026, 8:03pm

Hi Eugene,

my tunneler with your test branch just failed

From what I can see, at some point the controller accepts TCP, completes TLS, then immediately closes the connection without sending any HTTP response.

There were 156 failed refresh attempts in the 13-minute capture window. I can confirm the controller was reachable all the time since I've set up a monitoring that will curl the controller every second.

One complete failure cycle:
edge-tunnel-debug-branch.log (78.5 KB)

Any ideas?

dmuensterer · February 12, 2026, 11:45pm

On the controller side I am seeing lots and lots of the following error messages. Any ideas if they could be related?

Feb 13 00:44:46 zt ziti[1530746]: {"_context":"tls:0.0.0.0:8441","error":"cryptobyte: pending child length 65908 exceeds 2-byte length prefix","file":"github.com/openziti/transport/v2@v2.0.193/tls/listener.go:260","func":"github.com/openziti/transport/v2/tls.(*sharedListener).processConn","level":"error","msg":"handshake failed","remote":"IPADDRESS:53150","time":"2026-02-13T00:44:46.060Z"}

dmuensterer · February 13, 2026, 8:09am

Please don't mind me thinking out loud here
I am trying to think a bit outside the box to find the cause of this, since it seems we are seemingly the only ones with this issue? That must mean we do something differently than most other users.

I can't think of any edge case except that the controller is quite old (2022) and I pushed it to the new versions regularly. Maybe there's something in the certificates that lost compatibility in the meantime?

I am quite certain the breaking point for us is oidc auth. With tunnelers v.1.5.4 that don't yet support the edge-oidc endpoint we don't have this issue.
Also, our ZDE on Mac always use legacy auth even though the controller should support it: Mac Ziti Desktop Edge randomly loosing authentication - #2 by scareything

hm...

zt.yaml

v: 3
db:                     "/home/ziti/.ziti/quickstart/zt/db/ctrl.db"
identity:
  cert:        "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-client.2023-17-10.cert"
  server_cert:          "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem"
  key:                  "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/keys/zt-server.key"
  ca:                   "/home/ziti/.ziti/quickstart/zt/pki/cas.pem"
  alt_server_certs:
    - server_cert: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem"
      server_key: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/privkey.pem"
trustDomain: "company.com"
ctrl:
  listener:             tls:0.0.0.0:8440
mgmt:
  listener:             tls:0.0.0.0:10000
healthChecks:
  boltCheck:
    interval: 30s
    timeout: 20s
    initialDelay: 30s
edge:
  api:
    sessionTimeout: 30m
    address: zt.company.com:8441
  enrollment:
    signingCert:
      cert: /home/ziti/.ziti/quickstart/zt/pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert
      key:  /home/ziti/.ziti/quickstart/zt/pki/zt-signing-intermediate/keys/zt-signing-intermediate.key
    edgeIdentity:
      duration: 180m
    edgeRouter:
      duration: 180m
web:
  - name: client-management
    bindPoints:
      - interface: 0.0.0.0:8441
        address: zt.company.com:8441
    identity:
      ca:          "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert"
      key:         "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/keys/zt.company.com-server.key"
      server_cert: "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem"
      cert:        "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-client.2023-10-17-3103.cert"
      alt_server_certs:
        - server_cert: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem"
          server_key: "/home/ziti/.ziti/quickstart/zt/le-cert-renewal/certs/live/le.zt.company.com/privkey.pem"
    options:
      readTimeout: 5000ms
      writeTimeout: 100000ms
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
    apis:
      - binding: edge-client
        options: { }
      - binding: edge-oidc
        options: { }
      - binding: edge-management
        options: { }
  - name: management
    bindPoints:
      - interface: 127.0.0.1:18441
        address: zt.company.com:18441
    identity:
      ca:          "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert"
      key:         "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/keys/zt.company.com-server.key"
      server_cert:          "/home/ziti/.ziti/quickstart/zt/pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem"
      cert:        "/home/ziti/.ziti/quickstart/zt/pki/zt-intermediate/certs/zt-client.2023-17-10.cert"
    options:
      readTimeout: 5000ms
      writeTimeout: 100000ms
      minTLSVersion: TLS1.3
      maxTLSVersion: TLS1.3
    apis:
      - binding: edge-management
        options: { }
      - binding: zac
        options:
          location: /home/ziti/.ziti/quickstart/zt/ziti-console
          indexFile: index.html
      - binding: fabric
        options: { }
metrics:
  reportInterval: 20s
events:
  edgeSessionLogger:
    subscriptions:
      - type: edge.sessions
        include:
          - created
    handler:
      type: file
      format: json
      maxsizemb: 50
      maxbackups: 2
      path: /var/log/ziti/edge-sessions.log```

andrew.martinez · February 13, 2026, 3:13pm

How big is your controllers CA bundle and how many external CAs do you have enabled?

That error happens server side when the TLS server (controller/router) tries to send back all the DNs it supports and TLS only allows 65k worth of those - which is a TON. The only way that could be blowing out is if you have a huge number of certs in your CA bundle (e.g. something appending over and over) or a huge number of external CAs enabled.

Most DNs fall into the 80-200 byte range meaning it can hold 324-799 DNs.

ekoby · February 13, 2026, 3:18pm

That would happen on all requests, right? I've seen happen sporadically on the same controller.
Maybe it has something to do with throttling logic? Don't we throttle pre-TLS handshake?

andrew.martinez · February 13, 2026, 3:20pm

It could be on different in the same controller depending on xweb bind points and if those bind points have custom CA bundles, but it would be on each request to the bind point. For routers there isn’t multiple bind points allowed.

I haven’t considered the throttling angle. I don’t see why throttling would set so many DNs there unless it is a fragmented/mangled packet that goes out.

plorenz · February 13, 2026, 3:23pm

The throttling does close TLS handshakes. However, it wouldn't result in the error:

"cryptobyte: pending child length 65908 exceeds 2-byte length prefix"

Still, I did find a bug in the throttling logic, so that could be related. @dmuensterer do you have TLS handshake throttling enabled? It was disabled by default until 1.8.0-pre5, where it was enabled by default, which is when I found the throttling bug.

Thank you,
Paul

dmuensterer · February 13, 2026, 3:44pm

I thought about this as well. However I don't have many CAs in my certs. In the cas.pem only 4:

# for f in \
    pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem \
    pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem \
    le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem \
    pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert \
    pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert \
    pki/zt-intermediate/certs/zt-client.2023-17-10.cert \
    pki/cas.pem
do
    echo "=== $f ==="
    wc -c "$f"
    grep -c "BEGIN CERTIFICATE" "$f"
    echo
done

=== pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem ===
4184 pki/zt-intermediate/certs/zt-server.2023-10-16-2734.chain.pem
2

=== pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem ===
4265 pki/zt.company.com-intermediate/certs/zt.company.com-server.2023-10-16-2734.chain.pem
2

=== le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem ===
2880 le-cert-renewal/certs/live/le.zt.company.com/fullchain.pem
2

=== pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert ===
2094 pki/zt-signing-intermediate/certs/zt-signing-intermediate.cert
1

=== pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert ===
2086 pki/zt.company.com-intermediate/certs/zt.company.com-intermediate.cert
1

=== pki/zt-intermediate/certs/zt-client.2023-17-10.cert ===
2053 pki/zt-intermediate/certs/zt-client.2023-17-10.cert
1

=== pki/cas.pem ===
8282 pki/cas.pem
4

I don't have any external CA's configured:

 ziti edge list cas
╭────┬──────┬───────┬───────┬─────────────┬───────────────────╮
│ ID │ NAME │ FLAGS │ TOKEN │ FINGERPRINT │ CONFIGURATION     │
├────┼──────┼───────┼───────┼─────────────┼─────┬──────┬──────┤
╰────┴──────┴───────┴───────┴─────────────┴─────┴──────┴──────╯
results: none

Flags: (V) Verified, (A) AutoCa Enrollment, (O) OttCA Enrollment, (E) Authentication Enabled

ekoby · February 13, 2026, 3:50pm

Dominic,

I took the liberty of updating your PR/branch to pull our latest changes (including tlsuv). Please, let us know how the testing goes.

we are ready merge the PR. (there is a small issue of missing strndup on Windows)

Topic		Replies	Views
Failure of Ziti-Edge-Tunnel Gateway Mode on Linux	20	448	March 18, 2024
SEGV crashing ziti-edge-tunnel von v0.22.15	1	119	February 15, 2024
Ziti client not reconnecting Ziti Desktop Edge for Windows	8	746	November 6, 2023
Certificate renewal showing path //keys/-server.key	6	144	August 21, 2024
ZET random loss of connection during usage Ziti Edge Tunnel	4	327	November 23, 2022

Controller v1.6.8 disconnects tunnelers

Related topics