MacOS connection refused error

I’m experiencing a rather strange issue. I’m running an OpenZiti controller and router version 1.8.0-pre4 with the latest client.

On my macOS clients, I get “connection refused” to some services unless I stop the edge tunnel on the macOS client and run nc -vc hostname.tld port at the exact moment the Ziti client connects.

I haven’t collected any logs yet, but I’ll make sure to do that.

I mainly wanted to check whether anyone else has run into this. I didn’t have this problem on 1.6.12 that part was completely stable there.

Hi @msbusk,

I definitely would ask you to look through the logs to see if there's anything helpful in there that we might be able to look at. Connection refused errors like that are certainly strange. My guess is the logs would help to see if there's any errors

Symptoms

  1. Multiple concurrent connection failures with the same error pattern:
    ERROR ziti-sdk:connect.c:1070 connect_reply_cb() connwebadmin.core.test.ziti failed to connect, reason=invalid session
    DEBUG ziti-sdk:connect.c:323 complete_conn_req() connwebadmin.core.test.ziti Disconnected failed: connection is closed

  2. Connection state transitions: Connecting => Disconnected immediately

  3. Timing: Multiple failed attempts within seconds (16:40:31 - 16:40:36)

  4. One successful connection (conn[4.38]) does establish and transfer data, but many others fail with invalid session

Observed Behavior

  • DNS resolution works correctly: webadmin.core.test.ziti100.64.0.4
  • Router connection is stable: ziti-router-01 connected successfully
  • API session refresh is working (every 15 seconds): GET[/current-api-session/service-updates] returns 200 OK
  • Service is listed as available in service updates
  • Other services show similar "invalid session" errors during the same timeframe

Sample Log Output

Failed Connections:
[2026-01-06T16:40:31.002Z] ERROR connect_reply_cb() connwebadmin.core.test.ziti failed to connect, reason=invalid session
[2026-01-06T16:40:31.104Z] ERROR connect_reply_cb() connwebadmin.core.test.ziti failed to connect, reason=invalid session
[2026-01-06T16:40:31.183Z] ERROR connect_reply_cb() connwebadmin.core.test.ziti failed to connect, reason=invalid session
[2026-01-06T16:40:31.183Z] ERROR connect_reply_cb() connwebadmin.core.test.ziti failed to connect, reason=invalid session
... (20+ more similar errors)

Successful Connection (when it works):
[2026-01-06T16:40:36.973Z] VERBOSE flush_to_client() connwebadmin.core.test.ziti client stalled: 4 bytes buffered
[2026-01-06T16:40:36.974Z] VERBOSE conn_inbound_data_msg() connwebadmin.core.test.ziti decrypted 16367 bytes

API Session Status:
[2026-01-06T16:40:44.133Z] DEBUG ctrl_body_cb() ctrl[https://controller.test.ziti:443] completed GET[/current-api-session/service-updates] in 0.165 s
[2026-01-06T16:40:44.133Z] VERBOSE ziti_services_refresh() ztx[4] scheduling service refresh 15 seconds from now

Questions

  1. What causes the "invalid session" error when the API session appears to be valid and refreshing correctly?
  2. Why do some connections succeed (4.38) while most fail with invalid session during the same timeframe?
  3. Is this related to session caching, token expiration, or service authorization timing?
  4. Could this be related to the pre-release router version (v1.8.0-pre4)?

Additional Context

  • The client was recently re-enabled after being disabled
  • Network path is stable (WiFi connection, no interruptions)
  • Same behavior affects multiple configured services
  • The successful connection (when established) transfers data normally

Any insights into what might cause these intermittent "invalid session" errors would be greatly appreciated.

(20840)[2026-01-06T16:16:32.722Z] ERROR ziti-sdk:connect.c:1070 connect_reply_cb() conn4.710/zeoXtpZ5/Connecting failed to connect, reason=invalid session
(20840)[2026-01-06T16:16:32.738Z] ERROR ziti-sdk:connect.c:1070 connect_reply_cb() conn4.711/6LPxBm47/Connecting failed to connect, reason=invalid session
(20840)[2026-01-06T16:16:33.711Z] ERROR ziti-sdk:connect.c:1070 connect_reply_cb() conn4.712/5pQP7ySq/Connecting failed to connect, reason=invalid session

Just a small update: the same error occurs in the new version v1.8.0-pre5.

It does not happen in the older versions (1.6.12). Additionally, it occurs on both Windows and macOS clients.

We have primarily seen the issue with HTTPS connections.

Thanks for testing on 1.8.0-pre5. Can you outline how exactly you have your overlay setup? How many controllers, how many routers are they all on 1.8.0-pre5 now, did you restart them all etc.

I have run 1.8.0-pre5 and i'm not seeing this particular issue but it's not uncommon for some deviation in configuration to cause problems that we don't notice when running locally/doing development. Any extra information would help. I'll also ask if there's any other pieces of information that might help figure out what's happening.

You’re welcome.

What both installations have in common is that we only started experiencing this after we moved to version 1.8.*.

We are running Azure VMs with the Router and Controller in Docker. We have one single controller and one single router.

We have another setup with a VM running a Controller and a Router, plus one additional router, where we are also experiencing the issue.

We have updated all components to version 1.8.5 (router and controller) and restarted everything after the update.

I forgot to mention that we are using a JWT file for enrollment, but we are experiencing the issue both with OIDC against Entra and with JWT enrollment.

Hi @msbusk , are there any router log messages related to the 'invalid session' error? That could help us figure out what flavor of invalid session you're encountering - malformed token, expired token, unrecognized signer, etc.

Thank you,
Paul

Invalid Session Errors After Server Restart

Environment

  • OpenZiti Version: 1.8.0-pre5
  • Controller Version: openziti/ziti-controller:1.8.0-pre5
  • Router Version: openziti/ziti-router:1.8.0-pre5
  • Client SDK: ziti-sdk-c
  • Client Platform: macOS
  • Deployment: Docker containers

Context

The OpenZiti server (controller + router) was restarted, which caused all clients to disconnect. After the restart, we are experiencing persistent "invalid session" errors when clients attempt to reconnect and use hosted services.

Problem Description

After server restart, clients that attempt to reconnect receive "invalid session" errors. The errors occur both when:

  1. Trying to establish edge channels
  2. Attempting to create terminators for hosted services
  3. Trying to connect to services (dial fabric)

The errors persist over hours with hundreds of occurrences, suggesting the client is not successfully re-authenticating.

Detailed Error Analysis

Error Type Identification

The specific error is: "no api session found for token"

This indicates the controller cannot find the API session token in its database. This is not:

  • A malformed token error
  • An expired token signature error
  • An unrecognized signer error

The token format is valid, but the session simply doesn't exist in the controller's database (expected after restart).

Router Logs

When the client attempts to establish an edge channel, the router receives the old API session token and queries the controller:

{
  "error": "no api session found for token [<UUID>], fingerprint: [<fingerprint>], subjects [[CN=<username>,O=OpenZiti...]]",
  "file": "github.com/openziti/ziti/router/xgress_edge/accept.go:280",
  "func": "github.com/openziti/ziti/router/xgress_edge.(*Acceptor).handleUngroupedUnderlay",
  "level": "error",
  "msg": "failure accepting edge channel u{classic}->i{ziti-sdk-c[2]@MacBook-Air/xxxx} with underlay",
  "time": "2026-01-08T08:02:25.927Z"
}
{
  "error": "no api session found for token [<UUID>], fingerprint: [<fingerprint>], subjects [[CN=<username>,O=OpenZiti...]]",
  "file": "github.com/openziti/channel/v4@v4.2.50/multi_listener.go:44",
  "func": "github.com/openziti/channel/v4.(*MultiListener).AcceptUnderlay",
  "isGrouped": false,
  "level": "error",
  "msg": "failed to create channel",
  "time": "2026-01-08T08:02:25.927Z"
}

Connection attempts to services fail with the same root cause:

{
  "_context": "ch{edge}->u{classic}->i{ziti-sdk-c[0]@MacBook-Air/xxxx}",
  "chSeq": 61,
  "connId": 20,
  "edgeSeq": 0,
  "error": "invalid session",
  "file": "github.com/openziti/ziti/router/xgress_edge/listener.go:1378",
  "func": "github.com/openziti/ziti/router/xgress_edge.(*nonXgConnectHandler).FinishConnect",
  "level": "warning",
  "msg": "failed to dial fabric",
  "time": "2026-01-07T20:52:43.552Z",
  "type": "EdgeConnectType"
}

Multiple connection attempts observed (connId: 20, 21, 22, 23, 24, 25...) all failing with the same error.

Controller Logs

The controller fails to load the session from its database:

{
  "error": "invalid session",
  "file": "github.com/openziti/ziti/controller/handler_edge_ctrl/common.go:334",
  "func": "github.com/openziti/ziti/controller/handler_edge_ctrl.(*baseSessionRequestContext).loadFromBolt",
  "level": "error",
  "msg": "invalid session",
  "operation": "create.terminator",
  "time": "2026-01-07T20:49:53.167Z"
}

The error is returned to the router during terminator creation:

{
  "_context": "ch{<routerId>}->u{classic}->i{<routerId>/xxxx}",
  "error": "invalid session",
  "file": "github.com/openziti/ziti/controller/handler_edge_ctrl/create_terminator_v2.go:212",
  "func": "github.com/openziti/ziti/controller/handler_edge_ctrl.(*createTerminatorV2Handler).returnError",
  "level": "error",
  "msg": "responded with error",
  "routerId": "<routerId>",
  "terminatorId": "<terminatorId>",
  "time": "2026-01-07T20:49:53.167Z"
}

Router Response

The router removes the terminator due to the controller error:

{
  "file": "github.com/openziti/ziti/router/xgress_edge/hosted.go:473",
  "func": "github.com/openziti/ziti/router/xgress_edge.(*hostedServiceRegistry).Remove",
  "level": "info",
  "msg": "terminator removed from router set",
  "reason": "received error from controller: invalid session",
  "terminatorId": "<terminatorId>",
  "time": "2026-01-07T20:49:53.169Z"
}

This pattern repeats for dozens of different terminators over several hours.

Timeline

  1. Server restart - Controller and router containers restarted
  2. Client disconnect - All client connections lost
  3. Client reconnect attempts - Clients attempt to reconnect using cached/old API session tokens
  4. Session validation fails - Controller cannot find sessions (expected - they were cleared on restart)
  5. Errors persist - Same "invalid session" errors continue for hours with no successful recovery

Frequency & Impact

  • Hundreds of errors observed over a 2-3 hour period
  • Multiple terminators affected (50+ unique terminator IDs in logs)
  • Multiple connection attempts failing repeatedly
  • No successful service connectivity achieved

Technical Flow

The error originates from this sequence:

  1. Client (ziti-sdk-c) attempts to establish edge channel with cached API session token
  2. Router receives connection attempt and forwards token to controller for validation
  3. Controller calls baseSessionRequestContext.loadFromBolt() at common.go:334
  4. BoltDB lookup fails - session does not exist (cleared during restart)
  5. Controller returns "invalid session" error to router
  6. Router rejects connection/terminator with generic "invalid session" message
  7. Client receives "invalid session" error
  8. Client repeats same behavior - continues using old token instead of re-authenticating

Questions for OpenZiti Team

  1. Expected Client Behavior: After receiving "invalid session" errors, should the ziti-sdk-c automatically trigger re-authentication? Or does the application need to handle this?

  2. Error Specificity: Would it be helpful to return a more specific error code (e.g., "SESSION_NOT_FOUND" vs "SESSION_EXPIRED" vs "SESSION_INVALID") so clients can make better recovery decisions?

  3. Client SDK Issue: Is this a known issue with ziti-sdk-c where it doesn't properly handle session invalidation after network interruption?

  4. Session Persistence: Are API sessions expected to survive a controller restart, or is this behavior (sessions cleared on restart) correct?

  5. Recovery Mechanism: What is the recommended recovery mechanism for clients in this scenario? Should they:

    • Detect "invalid session" and automatically re-authenticate?
    • Have a timeout that triggers re-authentication?
    • Rely on application-level error handling?

Current Workaround

The only way to restore connectivity is to manually restart the client application to force re-authentication.

Additional Information

I can provide:

  • Full router logs with all invalid session errors
  • Full controller logs for the time period
  • Client-side logs if available
  • Specific configuration files (with sensitive data redacted)

Please let me know what additional debugging information would be helpful!

OpenZiti 1.8.0-pre5 Control Channel Stuck in "reconnecting" State - Regression Bug

Critical Issue Summary

After upgrading from 1.6.12 to 1.8.0-pre5, router-controller control channel is permanently stuck in "reconnecting" state despite correct network configuration, causing all circuit creation to fail with timeouts.

This worked perfectly in 1.6.12 - this is a critical regression bug in 1.8.0-pre5.

Environment

  • Previous Version: 1.6.12 (working correctly)
  • Current Version: 1.8.0-pre5 (broken)
  • Controller: openziti/ziti-controller:1.8.0-pre5
  • Router: openziti/ziti-router:1.8.0-pre5
  • Client SDK: ziti-sdk-c
  • Deployment: Azure VMs, Docker containers
  • Architecture: Single controller + single router on same Docker network
  • Platforms Affected: macOS and Windows clients

The Bug: Control Channel Stuck in "reconnecting"

Router logs show control channel permanently in "reconnecting" state:

{
  "context": "ch{ctrl}->u{reconnecting}->i{NetFoundry Inc. Client s3n-fHzTu/nB58}",
  "error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded",
  "file": "github.com/openziti/ziti/router/handler_ctrl/route.go:139",
  "msg": "failure while handling route update"
}

Key observation: Every single route request shows u{reconnecting} in context, never showing a stable connection state.

Frequency: 14+ occurrences per hour, affecting every circuit creation attempt.

Impact: Mass Circuit Creation Failures

217 circuit creation failures in 6 hours - 100% failure rate:

Controller Logs

{
  "_channels": ["selectPath"],
  "apiSessionId": "c6e8c16a-9189-498a-af23-7b81032618db",
  "attemptNumber": 2,
  "circuitId": "5Q9Vb5lZA2u6J7L29W2mos",
  "file": "github.com/openziti/ziti/controller/network/network.go:674",
  "level": "warning",
  "msg": "circuit creation failed after [2] attempts, sending cleanup unroutes",
  "serviceId": "4Ars44EiIHcFOO8t7GVAgQ",
  "serviceName": "Elitecom-Openziti-Web",
  "sessionId": "cmk6ngwhk2te001pc4olkcz6f"
}
{
  "attemptNumber": 1,
  "circuitId": "1VBo17d5yR88FkQsi8PgzU",
  "file": "github.com/openziti/ziti/controller/network/routesender.go:197",
  "level": "warning",
  "msg": "received failed route status from [r/8SGqhnW74C] for attempt [#0]",
  "error": "(error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded)"
}

Router Logs

{
  "_channels": ["establishPath"],
  "apiSessionId": "c6e8c16a-9189-498a-af23-7b81032618db",
  "attempt": 0,
  "attemptNumber": "1",
  "binding": "edge",
  "circuitId": "1VBo17d5yR88FkQsi8PgzU",
  "context": "ch{ctrl}->u{reconnecting}->i{NetFoundry Inc. Client s3n-fHzTu/nB58}",
  "destination": "5gKN8j9cBnjjNS5ICeC2IM",
  "error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded",
  "file": "github.com/openziti/ziti/router/handler_ctrl/route.go:139",
  "level": "error",
  "msg": "failure while handling route update"
}

Terminator Cleanup

{
  "file": "github.com/openziti/ziti/router/xgress_edge/hosted.go:473",
  "level": "info",
  "msg": "terminator removed from router set",
  "reason": "received error from controller: timeout waiting for message reply: context deadline exceeded",
  "terminatorId": "<id>"
}

Client-Side Symptoms

ERROR connect_reply_cb() conn<service> failed to connect, reason=invalid session
DEBUG complete_conn_req() conn<service> Disconnected failed: connection is closed

Configuration Details

Router Configuration:

ctrl:
  endpoint: tls:openziti.itglobal.dk:443

Docker Network:

  • :white_check_mark: Router and controller on same Docker network (ziti-controller_ziti)
  • :white_check_mark: Internal connectivity verified
  • :white_check_mark: No network isolation issues

Session Configuration:

api_session_enforcer:
  frequency: 5s
  sessionTimeout: 30m0s

Router Cache:

listeners:
  - binding: edge
    options:
      getSessionTimeout: 60

Database: Successfully migrated to version 44

Resources:

  • Router: 5.43% CPU, 97MB RAM
  • Controller: 19.31% CPU, 139MB RAM

Timing Pattern (Client Perspective)

Clients observe:

  • Quick reconnect (< 60 sec): :white_check_mark: Sometimes works (hits timing window)
  • Delayed reconnect (> 60 sec): :cross_mark: Always fails

This suggests router's API session cache (60s timeout) occasionally masks the underlying control channel issue, but once cache expires, the control channel problem becomes visible.

What We've Verified

:white_check_mark: Network connectivity: Router and controller on same Docker network
:white_check_mark: TLS certificates: Valid, no handshake failures
:white_check_mark: Resources: CPU and memory normal
:white_check_mark: Client tokens: Fresh tokens after restart
:white_check_mark: Database: Migration successful to v44
:white_check_mark: Configuration: Controller endpoint correct
:white_check_mark: No restarts: Controller and router are stable, no crashes

The Regression

Version 1.6.12 Behavior (Working)

  • Control channel stable
  • Circuit creation: 100% success rate
  • Clients reconnect automatically after server restart
  • No "reconnecting" state observed
  • No timeout errors

Version 1.8.0-pre5 Behavior (Broken)

  • Control channel permanently in "reconnecting" state
  • Circuit creation: 0% success rate
  • Clients cannot establish connections
  • Persistent "timeout waiting for message reply" errors
  • Requires manual intervention (client restart)

Critical Questions for OpenZiti Team

1. Control Channel State Machine Bug

Question: Why is the control channel stuck in "reconnecting" state when network connectivity is verified and stable?

Code locations:

  • github.com/openziti/ziti/router/handler_ctrl/route.go:139 - where timeouts occur
  • Control channel context shows: u{reconnecting} instead of connected state

Hypothesis: Bug in control channel reconnection logic in 1.8.0-pre5 prevents successful connection establishment or keeps state machine stuck in reconnecting loop.

2. Message Reply Timeout

Question: What is the expected timeout for "waiting for message reply" during route creation?

Observation: Every single route request times out with "context deadline exceeded"

Code location: github.com/openziti/ziti/controller/network/routesender.go:197

3. Changes Between Versions

Question: What changed in control channel management between 1.6.12 and 1.8.0-pre5?

Specific areas of interest:

  • Control channel connection/reconnection logic
  • Message timeout handling
  • Route creation protocol
  • Circuit establishment flow

4. Known Issues

Question: Is this a known issue in 1.8.0-pre5? Are there any:

  • Patches available (pre6, pre7, etc.)?
  • Configuration workarounds?
  • Debug logging we should enable?

5. Diagnostic Data Needed

Question: What additional information would help diagnose this?

We can provide:

  • Full router logs with circuit failures
  • Full controller logs
  • Router/controller debug level logs
  • Network packet captures
  • Any other diagnostics requested

Request

This appears to be a critical regression bug in 1.8.0-pre5's control channel implementation.


Additional Context: We have TWO separate OpenZiti deployments (different Azure VMs), both experiencing identical symptoms after upgrading to 1.8.0-pre5. Both worked flawlessly on 1.6.12. This confirms it's not an environment-specific configuration issue but a regression in 1.8.0-pre5.

There's definitely something unusual happening in your setup.

I don't think it's the control channel, though. The reason you're seeing 'reconnecting' in the log messages is because that's the type of channel implementation. The router -> controller channel uses a 'reconnecting' implementation. It automatically reconnects after disconnect and encapsulates that functionality so other router code doesn't need to worry about it.

If the router is actually reconnecting, you'll see router messages that contain the phrase starting reconnection process and reconnected.

The question is why are route responses from the router to the controller timing out?
Would you be willing to grab some stack dumps so we can see the system state when this is happening?

You can run ziti fabric inspect stackdump -f to grab stack dumps from all controllers/routers connected to the controller you run the command against. The -f flag will cause the stackdumps for each process to be placed in a separate file.

Alternatively you can run ziti agent stack > output.file on the controller/router nodes directly. There's a write-up here: How To Gather OpenZiti Diagnostics · openziti/ziti Wiki · GitHub

My rough guess is that the controller is getting locked up and is unable to processing messages coming in from the routers. If that's the case, the controller stack dumps should show us that and what's causing it.

I've been running regression tests in preparation for the 1.8.0 release and my data flow tests haven't hit anything like what you're describing, so hopefully we can figure out what's different about your setup and get it fixed before doing the final release.

Thank you,
Paul

Hi @plorenz,

Thank you for the clarification about the "reconnecting" channel implementation - that makes sense now! I was misinterpreting the context field.

Stack Dumps Collected

I've collected stack dumps as requested:

**Files:**

- `controller-stackdump-20260109-183851.txt` (32KB)

- `router-stackdump-20260109-183854.txt` (129KB)

Both collected using `ziti agent stack` while the issue was occurring (circuit creation failures happening).

## Additional Context About "Reconnection"

You're right - I checked the router logs more carefully and I do **NOT** see:

- "starting reconnection process"

- "reconnected"

So the control channel itself is stable. The "reconnecting" in the context was indeed just the implementation type.

## The Real Question

As you said: **Why are route responses from the router to the controller timing out?**

### What We're Seeing

Every route creation attempt times out:

**Controller side:**

"msg": "received failed route status from [r/8SGqhnW74C] for attempt [#0]",

"error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded"

```

**Router side:**

"error": "error creating route for [c/XXX]: timeout waiting for message reply: context deadline exceeded",

"file": "github.com/openziti/ziti/router/handler_ctrl/route.go:139"

```

### Frequency & Pattern

- **217 circuit creation failures in 6 hours**

- **100% failure rate** - not a single successful circuit creation

- All for the same service (sj-Openziti-Web, service ID: 4Ars44EiIHcFOO8t7GVAgQ)

- Router ID: 8SGqhnW74C

### Your Hypothesis

> My rough guess is that the controller is getting locked up and is unable to processing messages coming in from the routers.

This would definitely explain:

- Why **every** route request times out (not intermittent)

- Why it affects all circuit creation attempts

- Why clients see "invalid session" (secondary effect of failed circuits)

## Environment Details That Might Be Relevant

### Setup

- Single controller + single router on same Docker network

- Azure VMs

- Controller and router have been running stable for ~2677 minutes (router) / ~2689 minutes (controller) without restart

### What Changed

- Upgraded from 1.6.12 (working perfectly) to 1.8.0-pre5 (this issue appeared)

- **Two separate deployments** (different Azure VMs) both experiencing identical symptoms

### Configuration

- Database migrated successfully to version 44

- Session timeout: 30m

- Router cache: 60s (`getSessionTimeout`)

- Controller endpoint: `tls:openziti.domain.dk:443`

## Questions

1. **Stack dumps**: Do the stack dumps show anything interesting? Any goroutines blocked or waiting on locks?

2. **Debug logging**: Should I enable debug level logging on controller/router to capture more detail during circuit creationn

3. **Database**: Could there be something with database version 44 that's causing lock contention? The migration completed successfully but maybe there's a performance issue?

4. **Timeouts**: What's the expected timeout for route response messages? Is there a configuration we can adjust to see if that helps?

## How to Reproduce in Your Tests?

Since you're not hitting this in regression tests, here's what might be different:

- **Service type**: Hosted service (terminator hosted on the router)

- **Service binding**: Edge

- **Multiple concurrent connection attempts**: HTTPS with multiple parallel connections

- **Database**: Migrated from 1.6.12 schema to v44 (not fresh install)

- **Deployment**: Docker containers with Let's Encrypt certificates mounted

## Next Steps

I've attached the stack dumps. Let me know if you need:

- Additional diagnostics

- Specific debug logging enabled

- Different timing for stack dump collection (during vs between failures)

- Controller/router metrics

- Network traces

Happy to help debug this further - especially if it helps catch something before 1.8.0 final release!

router-stackdump-redacted.txt (128.6 KB)

controller-stackdump-redacted.txt (31.9 KB)

Hello, thank you for getting and sending over the stack dumps. Unfortunately, they don't show any routing activity.

This should manifest in the controller stack dump as one or more calls to Network.CreateCircuit - ziti/controller/network/network.go at v1.8.0-pre5 · openziti/ziti · GitHub

On the router side, we should the routeHandler processing route messages: ziti/router/handler_ctrl/route.go at v1.8.0-pre5 · openziti/ziti · GitHub

I do see 11 active circuits on the router, based on the number of buffer routines. It can be tricky to catch these in progress, since the timeouts are usually 5s. Often when I'm trying to catch these kinds of things, I'll set the stack dumps to be captured in a loop, every couple of seconds then trigger the behavior. Then I'll scan the stack dumps for the methods that would need to running and discard anything not relevant.

The other thing I notice is that the circuits look like they're single hop (ingress/egress on the same router). That's a use case which doesn't get tested as much, so that's a potential lead to follow.

Let me know if you can grab some additional stacks.

Thank you,
Paul

Thanks Paul :slight_smile:

I captured stack dumps every 2 seconds for 60 seconds (30 captures total) while triggering client connection attempts

controller-6-redacted.txt (82.0 KB)

router-9.txt (126.3 KB)

The same problem Ziti edge tunnel: dial connections refused after ~5 minutes usage

I think there's overlap. The invalid session problem seems likely to be a c-sdk issue, as it's not refreshing sessions after they become invalid. The c-sdk folks are digging into that issue.

I'm digging into this because of the server side timeouts, which seem suspicious.

Paul

Leaving this response here, but there's a more relevant follow-up below

Hi @msbusk I looked at the stacks you sent but didn't find anything unusual. The controller side was in CircuitCreate, but it was in the middle of checking if the session was valid which happens before route calculation and route establishment.

Some thoughts:

  1. Were you still seeing timeouts in the router logs in those 60 seconds? If yes, there should be at least some stacks from the routers where it's trying to send route responses.

  2. You can turn on debug level messages in the router, the relevant code looks like this:

	log.Debug("sending success response")
	if err := response.WithTimeout(rh.env.GetNetworkControllers().DefaultRequestTimeout()).Send(rh.ch); err == nil {
		log.Debug("handled route")
	} else {
		log.WithError(err).Error("send response failed")
	}

So you'll see the sending success response and one of the two follow up messages.

  1. Do you mind looking through the rest of the stacks you captured for other cases where it's in CreateCircuit and looking for time correlated router stacks where it's in routeHandler? Or you can send them all to me in a bundle and I can look through them, as long as there's an indication of when they were captured for correlation.

  2. Based on what's in the stacks, you're using legacy sessions. Would you be willing to try enabling OIDC sessions and see how that effects the tests?

Thank you,
Paul

@msbusk I took another look through your previous messages to see if I'd missing anything, and I had.

The timeout you message you got was from line 139 of route.go, which is here:

func (rh *routeHandler) fail(msg *channel.Message, attempt int, route *ctrl_pb.Route, err error, errorHeader byte, log *logrus.Entry) {
	log.WithError(err).Error("failure while handling route update")

That means the timeout message was from when the router was sending a dial message to the hosting SDK. This means that there's likely no lockup in the controller or router, just SDK not responding to the dial request.

Would it be possible for you to try removing the c-sdk from the picture, so we can narrow down the problem scope? Since this is on the hosting side, you could either use the ER/T to host or ziti tunnel. If you're willing to try it and need some helping with configuration, let me know and I can try to walk you through it.

Thank you,
Paul