HA not working as expected with ext-jwt signer (without JIT enrollment)

I have an 2.0.0-pre12 setup that is configured to use a local Keycloak instance as an external JWT signer without JIT enrollment of external identities. This means I'm pre-creating the identities for users in Ziti with their external ID configured to use our Keycloak, so they can simply login from any endpoint with an Edge Client installed and have their services available from wherever they are.

Now this setup works just fine with a standalone controller. But I'm currently trying out controller clustering and this setup seems to be not compatible with HA:

Clustering and Raft works fine, everything is setup and when using an identity enrolled with a JWT, all the controller failover scenarios work just fine. Edge client is picking up the new leader, reauthenticates and all my sessions to any Ziti services in the network work just fine even when stopping and starting the tunnel on the client side.

Now when I'm using a precreated identity as mentioned above that is simply configured with Keycloak as the ext-jwt-signer, this breaks. When the initial controller goes down, so that there are just two cluster members remaining and the leader changes, the currently authenticated session works just fine.

But as soon as I stop and restart the tunnel on the client side and reauth with Keycloak (this also works and I see my services in the Edge client as green) and want to connect to a service again, the router that is terminating the service is throwing invalid client certificate errors:

{"error":"invalid client certificate for api session","file":"github.com/openziti/ziti/v2/router/xgress_edge/accept.go:288","func":"github.com/openziti/ziti/v2/router/xgress_edge.(*Acceptor).handleUngroupedUnderlay","level":"error"
,"msg":"failure accepting edge channel u{classic}-\u003ei{ziti-sdk-c[0]@Mac.lan/P8Zb} with underlay","time":"2026-05-06T09:54:20.601Z"}

It seems like the fact that with this setup (which works just fine in a single-controller setup) that is missing the client certificate enrollment, HA is not working as the routers depend on a client certificate that is somehow not there with this "simple" ext-jwt-signer setup.

As mentioned, when I enroll an identity with a OTT JWT from Ziti itself, the failover works just fine, but for users that need to be able to jump between machines with their same Keycloak account, it is not really an option to re-enroll their identity every time they use a different computer.

We recently detected a bug where token based authentication from some SDKs would errantly include a client certificate.

Do you happen to know off hand if you used to have an certificate based enrollment for the identity in use?

The identities in question were only ever created the following way:

ziti edge create identity "user@example.com" --external-id "user@example.com" -a "my_attribute"

No JWT created for those. And as mentioned, they work just fine in standalone mode and also in HA mode as long as the first cluster member stays up (doesn't even need to be the leader, just reachable by the Edge client and the router).

But as soon as the first controller node is down and I restart the tunnel in the Edge client and re-authenticate with Keycloak, the router throws the "invalid client certificate" messages.

When I create an identity with ziti edge create identity failover-test -a "my_attribute" -o ./test.jwt and use the JWT to enroll instead of Keycloak auth, all the HA stuff works just fine, even after restarting the tunnel. The router is not complaining about invalid client certs in this case.

@andrew.martinez with the bugfix, do you mean this commit here: fixes #3846 OIDC tokens from non-cert auth no longer bind incidental … · openziti/ziti@767ff11 · GitHub

Is there a build with this already available I could use to test? Maybe this fixes the issue for us.

Okay, found it, got the ziti binary from the corresponding build here: fixes #3846 OIDC tokens from non-cert auth no longer bind incidental … · openziti/ziti@767ff11 · GitHub

However, this does not fix the issue. Same behavior, on failover the ext-jwt-authenticated session produces the same invalid client certificate for api session error on the router as before.

Ok. I'll see if I can create a synthetic reproduction.

I found a smoking gun that partially explains what you are seeing in HA. I haven't connected all the dots as to why. I believe it might be due to a misconfiguration of the verification pools in ERs that is related to the 2.0 certificate proof-of-possession enhancements - from a different angle and not accounted for in my other bugs.

I logged this issue, working on a fix for 2.0: Router rejects API session certificate after HA failover for ext-jwt identities without enrollment · Issue #3857 · openziti/ziti · GitHub

Thanks a lot for your effort!

I just tried the new build again, but unfortunately I still see the same behavior. I also re-enrolled the router just to be sure, but it didn't make a difference.

One thing I noticed that the configured router CA file that gets created during enrollment only contains the CA cert and then the edge cert of one controller, but not the others in the cluster. But, even after adding the other controllers' edge certs in there, it's still throwing the same invalid client errors. And I'm not sure if this should even make a difference as JWT-enrolled identities work just fine.

The only thing that matters is the root CA for the entire PKI. Intermediates should be supplied with the API Session Cert bundle to complete chain verification.

What you did should not be necessary.

If you do not mind, can you supply the tunneler configuration file you are using, the tunneler log from a focused run? Additionally the CA bundle file. None of that should contain private keys and please do not send me any. I believe you can send them to me through a private message.

If you could do the same for your controllers that would also be helful but also include the intermediate cert they are using as their signing cert.

Again, no private keys.

I will try to see if I can recreate any other issues in a lab environment, but I did not see anything additional the last time. That leaves configuration issues as a possiblity.

I did just find a related bug in the sdk-golang. However if you are running a tunneler that would use the c-sdk and the c-sdk doesn't have the same bug.

I see the issue on both Ziti Desktop Edge for Windows and as well macOS, so that would rule out the sdk-golang bug.

I will do another test run and let you know with the logs, etc... once done.

@andrew.martinez I did another run and sent you the files via DM.

@andrew.martinez did you have a chance to look into it already? I retried the same again with the latest 2.0.0-pre14 builds and still experience the same behavior.

Taking a look today and writing a reproduction following your log to see if I can recreate it.

Thanks a lot! Let me know if you need anything else to support.

I tried to reproduced your issue in a few different ways now.

I took the log you provided and used it as a template for the exact same steps in a controlled environment with a 3 controller cluster, with external jwt signers, and identities configured and setup as you reported. This was purely in the ziti repo's go based integration tests. Killing both random controllers and the controller specifically connected to in the test.

No issue.

I then took the same setup, cleaned, and used the C-SDK version your logs were using. I did the same thing overserving the client when controller failover occured.

No issue.

Looking at the assets you sent me I noticed that your intermediate signer was named generically (e.g. my-edge-signer) and I then re-ran the tests in both environments to see if using the same singer across all 3 controllers would cause the issue.

No issue.

At this point I have no leads on what the issue is.

I'm also at a loss here :frowning:

I'm using the ziti-controller Helm chart for standing up the HA controllers in three separate Kubernetes K3s single-node clusters, but I don't think this should make a difference (but might explain the intermediate signer, as this is how the Helm chart creates them when doing the cluster-join steps described in the docs: Install the Controller in Kubernetes | NetFoundry Documentation).

After this topic really nagging on me, I spent some more time trying to debug this yesterday and after I didn't get really any results, I decided to feed this problem into Claude Code and got something:

router/state/cert_origin.go::getControllerRootPool built the root pool by calling certs[0].Verify(opts) against the router's CA bundle and appending chain[len(chain)-1] as the "root":

When the bundle contains both the edge-signer intermediate and the self-signed edge-root, Go's x509.Verify returns the shortest valid chain — which terminates at the edge-signer (treated as a trusted root for that verify call, since it is in opts.Roots). So chain[len-1] is the intermediate, not the actual root. In an HA cluster with N per-controller edge-signers chaining to a shared edge-root, only the enrollment controller's signer ever makes it into the root pool; sibling signers cannot chain.

OTT-enrolled identities were unaffected because their long-lived cert is signed once at enrollment and its signer is always present. ext-jwt identities request a fresh session cert via POST /current-api-session/certificates on every session, which is signed by whichever controller handles that request — often not the enrollment controller.

I have also built a fix locally for me with Claude that tries to get the root cert via the router's CaPool() in such a case and adds it to rootPool which seems to fix the problem for me.

I'd like to verify this with you and hopefully this is something that can also be changed upstream on your side.