HA not working as expected with ext-jwt signer (without JIT enrollment)

I have an 2.0.0-pre12 setup that is configured to use a local Keycloak instance as an external JWT signer without JIT enrollment of external identities. This means I'm pre-creating the identities for users in Ziti with their external ID configured to use our Keycloak, so they can simply login from any endpoint with an Edge Client installed and have their services available from wherever they are.

Now this setup works just fine with a standalone controller. But I'm currently trying out controller clustering and this setup seems to be not compatible with HA:

Clustering and Raft works fine, everything is setup and when using an identity enrolled with a JWT, all the controller failover scenarios work just fine. Edge client is picking up the new leader, reauthenticates and all my sessions to any Ziti services in the network work just fine even when stopping and starting the tunnel on the client side.

Now when I'm using a precreated identity as mentioned above that is simply configured with Keycloak as the ext-jwt-signer, this breaks. When the initial controller goes down, so that there are just two cluster members remaining and the leader changes, the currently authenticated session works just fine.

But as soon as I stop and restart the tunnel on the client side and reauth with Keycloak (this also works and I see my services in the Edge client as green) and want to connect to a service again, the router that is terminating the service is throwing invalid client certificate errors:

{"error":"invalid client certificate for api session","file":"github.com/openziti/ziti/v2/router/xgress_edge/accept.go:288","func":"github.com/openziti/ziti/v2/router/xgress_edge.(*Acceptor).handleUngroupedUnderlay","level":"error"
,"msg":"failure accepting edge channel u{classic}-\u003ei{ziti-sdk-c[0]@Mac.lan/P8Zb} with underlay","time":"2026-05-06T09:54:20.601Z"}

It seems like the fact that with this setup (which works just fine in a single-controller setup) that is missing the client certificate enrollment, HA is not working as the routers depend on a client certificate that is somehow not there with this "simple" ext-jwt-signer setup.

As mentioned, when I enroll an identity with a OTT JWT from Ziti itself, the failover works just fine, but for users that need to be able to jump between machines with their same Keycloak account, it is not really an option to re-enroll their identity every time they use a different computer.

We recently detected a bug where token based authentication from some SDKs would errantly include a client certificate.

Do you happen to know off hand if you used to have an certificate based enrollment for the identity in use?

The identities in question were only ever created the following way:

ziti edge create identity "user@example.com" --external-id "user@example.com" -a "my_attribute"

No JWT created for those. And as mentioned, they work just fine in standalone mode and also in HA mode as long as the first cluster member stays up (doesn't even need to be the leader, just reachable by the Edge client and the router).

But as soon as the first controller node is down and I restart the tunnel in the Edge client and re-authenticate with Keycloak, the router throws the "invalid client certificate" messages.

When I create an identity with ziti edge create identity failover-test -a "my_attribute" -o ./test.jwt and use the JWT to enroll instead of Keycloak auth, all the HA stuff works just fine, even after restarting the tunnel. The router is not complaining about invalid client certs in this case.

@andrew.martinez with the bugfix, do you mean this commit here: fixes #3846 OIDC tokens from non-cert auth no longer bind incidental … · openziti/ziti@767ff11 · GitHub

Is there a build with this already available I could use to test? Maybe this fixes the issue for us.

Okay, found it, got the ziti binary from the corresponding build here: fixes #3846 OIDC tokens from non-cert auth no longer bind incidental … · openziti/ziti@767ff11 · GitHub

However, this does not fix the issue. Same behavior, on failover the ext-jwt-authenticated session produces the same invalid client certificate for api session error on the router as before.

Ok. I'll see if I can create a synthetic reproduction.

I found a smoking gun that partially explains what you are seeing in HA. I haven't connected all the dots as to why. I believe it might be due to a misconfiguration of the verification pools in ERs that is related to the 2.0 certificate proof-of-possession enhancements - from a different angle and not accounted for in my other bugs.

I logged this issue, working on a fix for 2.0: Router rejects API session certificate after HA failover for ext-jwt identities without enrollment · Issue #3857 · openziti/ziti · GitHub

Thanks a lot for your effort!

I just tried the new build again, but unfortunately I still see the same behavior. I also re-enrolled the router just to be sure, but it didn't make a difference.

One thing I noticed that the configured router CA file that gets created during enrollment only contains the CA cert and then the edge cert of one controller, but not the others in the cluster. But, even after adding the other controllers' edge certs in there, it's still throwing the same invalid client errors. And I'm not sure if this should even make a difference as JWT-enrolled identities work just fine.

The only thing that matters is the root CA for the entire PKI. Intermediates should be supplied with the API Session Cert bundle to complete chain verification.

What you did should not be necessary.

If you do not mind, can you supply the tunneler configuration file you are using, the tunneler log from a focused run? Additionally the CA bundle file. None of that should contain private keys and please do not send me any. I believe you can send them to me through a private message.

If you could do the same for your controllers that would also be helful but also include the intermediate cert they are using as their signing cert.

Again, no private keys.

I will try to see if I can recreate any other issues in a lab environment, but I did not see anything additional the last time. That leaves configuration issues as a possiblity.

I did just find a related bug in the sdk-golang. However if you are running a tunneler that would use the c-sdk and the c-sdk doesn't have the same bug.

I see the issue on both Ziti Desktop Edge for Windows and as well macOS, so that would rule out the sdk-golang bug.

I will do another test run and let you know with the logs, etc... once done.

@andrew.martinez I did another run and sent you the files via DM.