To recap, you've moved from spot VMs to always-on VMs for the consistent availability of your nodes to address the above symptom, which was presumably caused by the controller becoming unavailable when the spot VM was paused.
You also fixed a storage pressure issue that may have caused at least one of your routers to be unavailable.
After moving to always-on VMs in GKE and fixing the full-disk problem in EC2, you observed a different situation in a tunneler log saying at least one edge router had become unavailable, correct?
This is not the same problem you addressed by solving the full-disk issue, correct?
You asked if the router's configuration or resources must be tuned to address this condition. First, did the tunneler continue to function normally at the time of the "router unavailable" log messages you shared? If not, what was the symptom that led you to investigate the logs?
I assume then the router that became unavailable was the only authorized router at that time. Do you continue to have the same problem when at least two routers are authorized concurrently?
[
{
"circuitCount": 9,
"file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
"func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
"level": "info",
" msg": "received circuit confirmation request",
"routerId": "PT9mKNWot",
"time": "2025-04-02T12:47:43.122Z"
},
{
"file": "github.com/openziti/ziti/controller/network/fault.go:32",
"func": "github.com/openziti/ziti/controller/network.(*Network).fault",
"level": "info",
"msg": "network fault processing for [314] circuits",
"time": "2025-04-02 T12:47:46.260Z"
},
{
"circuitCount": 9,
"file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
"func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
"level": "info",
" msg": "received circuit confirmation request",
"routerId": "p3-taAC8gI",
"time": "2025-04-02T12:47:51.834Z"
},
{
"_context": "ch{sTp8vPU8g}->u{classic}->i{aLrl}",
"error": "service 26ZlMZPqTpxdFSOwcXBax6 has no terminators",
"file": "github.com/openziti/ziti/controller/handler_edge_ctrl/common.go:79",
"func": "github.com/openzit i/ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError",
"level": "error",
"msg": "responded with error",
"operation": "create.circuit",
"routerId": "sTp8vPU8g",
"time": "2025-04-02T12:47:55.843Z",
"token": "bcdad952-4 b2d-46d2-bf08-768e30f51f85"
},
{
"file": "github.com/openziti/ziti/controller/network/fault.go:32",
"func": "github.com/openziti/ziti/controller/network.(*Network).fault",
"level": "info",
"msg": "network fault processing for [314] circuits",
"time": "2025-04-02 T12:48:01.261Z"
},
{
"circuitCount": 109,
"file": "github.com/openziti/ziti/controller/handler_ctrl/circuit_confirmation.go:47",
"func": "github.com/openziti/ziti/controller/handler_ctrl.(*circuitConfirmationHandler).HandleReceive",
"level": "info",
"msg": "received circuit confirmation request",
"routerId": "sTp8vPU8g",
"time": "2025-04-02T12:48:04.818Z"
},
{
"file": "github.com/openziti/ziti/controller/network/fault.go:32",
"func": "github.com/openziti/ziti/controller/network.(*Network).fault",
"level": "info",
"msg": "network fault processing for [314] circuits",
"time": "2025-04-02 T12:48:16.262Z"
}
]
I don't suspect the versions are a factor, no, seconding @TheLumberjack.
In summary, I'd focus on monitoring your router's availability and ensuring there's always at least two authorized routers for each identity+service combination.