Ntermittent Disconnection Issue – Ziti Controller & Public Router in GKE (v1.15)

I know of no other reports of sporradic connectivity issues with any version of the overlay. So I still suspect this is somehow related to either the network between nodes/edge devices (the underlay internet/network overall) or it's related to spot devices. As long as the links reform and stability returns (aka, the overlay recovers) then I don't know if there's anything we can do to fix this.

do u need to me upgrade and check?

Definitely. If you can provide very clear steps to reproduce, we would be most appreciative. We run thousands of routers all day long at NetFoundry and I've never heard any customers complain about this myself. So while I only have anecdotal evidence, it seems pretty strong. :frowning: I think it's just the underlay or the VM itself somehow and not related to OpenZiti. These sorts of issues are execptionally difficult to diagnose though, and even harder for us to try to remotely support that sort of situation via the forums. I dunno how much we can actually help.

To recap, you've moved from spot VMs to always-on VMs for the consistent availability of your nodes to address the above symptom, which was presumably caused by the controller becoming unavailable when the spot VM was paused.

You also fixed a storage pressure issue that may have caused at least one of your routers to be unavailable.

After moving to always-on VMs in GKE and fixing the full-disk problem in EC2, you observed a different situation in a tunneler log saying at least one edge router had become unavailable, correct?

This is not the same problem you addressed by solving the full-disk issue, correct?

You asked if the router's configuration or resources must be tuned to address this condition. First, did the tunneler continue to function normally at the time of the "router unavailable" log messages you shared? If not, what was the symptom that led you to investigate the logs?

I assume then the router that became unavailable was the only authorized router at that time. Do you continue to have the same problem when at least two routers are authorized concurrently?

[
  {
    "circuitCount": 9,
    "file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
    "func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
    "level": "info",
    " msg": "received circuit confirmation request",
    "routerId": "PT9mKNWot",
    "time": "2025-04-02T12:47:43.122Z"
  },
  {
    "file": "github.com/openziti/ziti/controller/network/fault.go:32",
    "func": "github.com/openziti/ziti/controller/network.(*Network).fault",
    "level": "info",
    "msg": "network fault processing for [314] circuits",
    "time": "2025-04-02  T12:47:46.260Z"
  },
  {
    "circuitCount": 9,
    "file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
    "func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
    "level": "info",
    "  msg": "received circuit confirmation request",
    "routerId": "p3-taAC8gI",
    "time": "2025-04-02T12:47:51.834Z"
  },
  {
    "_context": "ch{sTp8vPU8g}->u{classic}->i{aLrl}",
    "error": "service 26ZlMZPqTpxdFSOwcXBax6 has no terminators",
    "file": "github.com/openziti/ziti/controller/handler_edge_ctrl/common.go:79",
    "func": "github.com/openzit  i/ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError",
    "level": "error",
    "msg": "responded with error",
    "operation": "create.circuit",
    "routerId": "sTp8vPU8g",
    "time": "2025-04-02T12:47:55.843Z",
    "token": "bcdad952-4  b2d-46d2-bf08-768e30f51f85"
  },
  {
    "file": "github.com/openziti/ziti/controller/network/fault.go:32",
    "func": "github.com/openziti/ziti/controller/network.(*Network).fault",
    "level": "info",
    "msg": "network fault processing for [314] circuits",
    "time": "2025-04-02  T12:48:01.261Z"
  },
  {
    "circuitCount": 109,
    "file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
    "func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
    "level": "info",
    "msg": "received circuit confirmation request",
    "routerId": "sTp8vPU8g",
    "time": "2025-04-02T12:48:04.818Z"
  },
  {
    "file": "github.com/openziti/ziti/controller/network/fault.go:32",
    "func": "github.com/openziti/ziti/controller/network.(*Network).fault",
    "level": "info",
    "msg": "network fault processing for [314] circuits",
    "time": "2025-04-02  T12:48:16.262Z"
  }
]

I don't suspect the versions are a factor, no, seconding @TheLumberjack.


In summary, I'd focus on monitoring your router's availability and ensuring there's always at least two authorized routers for each identity+service combination.