OpenZiti v2.0.0 HA Cluster - Clarification Questions

Hi all,

We're currently running a single-node OpenZiti v2.0.0-pre2 setup (Docker Compose) and planning to expand to a 3-controller HA Raft cluster. I'd like to verify my understanding of how this works before we proceed.

### Our current setup

- 1 controller + 1 router on Docker Compose

- Edge clients connect via `openziti.northwind.io` (OIDC auth)

- Management/ZAC on port 8441

### Planned HA setup

```

                DNS Round-Robin or Load Balancer

                  openziti.northwind.io

                          │

              ┌───────────┼───────────┐

              ▼           ▼           ▼

         Controller1  Controller2  Controller3

         (Leader)     (Follower)   (Follower)

              │           │           │

              └────Raft Consensus─────┘

                          │

              ┌───────────┼───────────┐

              ▼           ▼           ▼

           Router1     Router2     Router3 ...

```

### Questions

**1. DNS Round-Robin - is it actually supported?**

My understanding is that Ziti SDKs have built-in failover. The flow would be:

1. SDK resolves `openziti.northwind.io` → gets 3 IPs

2. Tries first IP - if timeout/refused, tries next

3. Once connected, the SDK fetches the controller list from the API

4. From then on, the SDK uses the internal controller list for failover (independent of DNS)

Is this correct? Or is an L4 load balancer (e.g. Azure LB with TCP passthrough) required/recommended?

**2. Ports between controllers for Raft**

Does Raft communication between controllers use the same `ctrl.listener` port (443 in our case)? My understanding is that Raft traffic is tunneled over the existing TLS channels between controllers, so there is **no separate Raft port** to open.

Meaning the firewall rules would simply be:

```

# Between all 3 controllers (bidirectional)

ctrl1 ↔ ctrl2: TCP 443

ctrl1 ↔ ctrl3: TCP 443

ctrl2 ↔ ctrl3: TCP 443

```

Is this correct, or are there additional ports needed?

**3. Edge API address with multiple controllers**

In the controller config, the `edge.api.address` is what clients use. In a 3-controller setup, should all controllers advertise the **shared** LB/DNS hostname?

```yaml

# Same on all 3 controllers

edge:

api:

address: openziti.northwind.io:443

```

While `ctrl.options.advertiseAddress` is unique per node:

```yaml

# Unique per controller

ctrl:

options:

advertiseAddress: tls:ctrl1.northwind.io:443  # ctrl2, ctrl3 respectively

```

**4. OIDC with 3 controllers**

Since OIDC signing keys are replicated via Raft, a token issued by ctrl1 should be valid against ctrl2 and ctrl3. The OIDC issuer would be set to the shared hostname (`https://openziti.northwind.io`). Is this how it works in practice?

**5. Router configuration for HA**

Routers can be configured with multiple controller endpoints for failover:

```yaml

ctrl:

endpoints:

- tls:ctrl1.northwind.io:443

- tls:ctrl2.northwind.io:443

- tls:ctrl3.northwind.io:443

```

Or should routers also just point to the shared DNS/LB address?

**6. Bootstrap sequence**

My understanding of the initial cluster bootstrap:

1. Start ctrl1 first - it bootstraps as single-node Raft

2. On ctrl1: `ziti agent cluster add tls:ctrl2.northwind.io:443`

3. Start ctrl2 - it joins the cluster

4. On ctrl1: `ziti agent cluster add tls:ctrl3.northwind.io:443`

5. Start ctrl3 - it joins the cluster

Is this the correct sequence?

Thanks in advance for any clarification!

Hi @msbusk,

Before diving into the individual questions, as far as I know, using a load balancer in front of the controllers is somewhat problematic. You might be able to make it work, but I don't think it's recommended. However, most of work has been on the back-end of the clustering, so if I'm wrong, I'll have someone post a correction :slight_smile:


1. DNS Round-Robin / Load Balancer

The Ziti HA system was designed around direct controller discovery rather than load balancers. Here's how it works:

The identity file, post-enrollment, will generally contain all the endpoints in the cluster already. When the SDKs connect to a controller they will also check the /edge/client/v1/controllers endpoint, which returns the individual API URLs (edge-client and edge-oidc bindings) for every controller in the cluster. The SDK can then failover to any of those discovered URLs directly.

Routers are also kept up to date with cluster membership information.

Because of this built-in discovery, putting a load balancer in front creates an awkward situation:

  • If controllers advertise their own individual addresses, SDKs will discover those and connect directly, bypassing the LB after the first connection anyway.
  • If controllers all advertise the LB address, the /controllers endpoint returns N entries all pointing at the same hostname, which defeats the purpose of the discovery mechanism.
  • On the cert side, with L4 passthrough the client sees whichever controller's cert the LB routes to. Each controller's cert needs to contain the LB hostname in its SANs, and the
    controller validates this at startup.

The recommended approach is to give each controller its own DNS name. You might be able to get the load balancer approach to work, but it's not how it's intended to work.

2. Raft Communication Ports

You are correct, there's no separate port necessary.

3. Edge API Address Configuration

This is the value used to build the controller list shared with SDKs, so assuming you don't try to go with the load balancer approach, these should be unique.

The ctrl/options/advertiseAddress is used by other controllers and routers and need to be unique.

4. OIDC Token Validity Across Controllers

During the raft mesh peer handshake, controllers exchange their signing certificates. Each controller stores the signing certs of all its peers. This means any controller can verify tokens issued by any other controller.

Important detail: the OIDC authentication flow (the redirect/callback dance) must complete entirely with a single controller -- auth requests are not synchronized across nodes.
However, once tokens are issued, they can be validated at any controller. Token revocations are synchronized across the cluster via Raft.

In practice this means: if you're using DNS round-robin for OIDC, the auth flow redirect/callback should land on the same controller. With individual controller addresses this isn't a
concern since the SDK targets a specific controller for auth.

5. Router Configuration

Routers can be initialized with zero, one or multiple controller endpoints. When they connected they will get the current cluster set and persist it to a file. The reason I say zero, is that router enrollments now generally contain the cluster endpoints as well and the enroller will use those to initialize the router config.

6. Bootstrap Sequence

Yes, that bootstrap sequence sounds right to me.

Cheers,
Paul

Hi @msbusk

I checked in with my teammates and @qrkourier advised me that he thinks that although using a load balancer isn't necessary in most configurations, it should work if using TLS passthrough with TLS servername routing.

Cheers,
Paul