Issues after going dark + upgrade

I've made my ZAC (ziti-console:8443) and Zrok controller (zrok-controller:18080) dark by simply not exposing their ports via my external firewall and creating ziti services with host configs that directly connect to them.
This configuration was working fine for a few days, but now that I've updated my ziti to containers to the recently released v0.30.4 and restarted everything I'm having a few issues.

For ZAC I get a connection reset by peer error, with the following logs on the controller/router side:

ziti-controller   | [80037.413]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {now=[2023-10-03 22:13:57.412717914 +0000 UTC m=+63.033280146] identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80037.413]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] now=[2023-10-03 22:13:57.412949676 +0000 UTC m=+63.033511948]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80037.413]   DEBUG ziti/controller/handler_edge_ctrl.(*baseTunnelRequestContext).ensureSessionForService: {router=[ziti-home-edge-router] routerId=[UVUc5ScxZk] sessionType=[Dial] sessionId=[cln6irmy800gc01ppkh9h9due] operation=[tunnel.create.circuit]} session valid
ziti-controller   | [80037.413]   ERROR ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError [ch{UVUc5ScxZk}->u{classic}->i{M6LV}]: {error=[service 4ryLLcOB6TGozmLZEaOufl has no terminators] routerId=[UVUc5ScxZk] operation=[tunnel.create.circuit]} responded with error
ziti-controller   | [80037.488]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] now=[2023-10-03 22:13:57.48825907 +0000 UTC m=+63.108821262]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80037.488]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] identityId=[UVUc5ScxZk] now=[2023-10-03 22:13:57.488440391 +0000 UTC m=+63.109002623]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80037.489]   DEBUG ziti/controller/handler_edge_ctrl.(*baseTunnelRequestContext).ensureSessionForService: {sessionType=[Dial] sessionId=[cln6irmy800gc01ppkh9h9due] router=[ziti-home-edge-router] operation=[tunnel.create.circuit] routerId=[UVUc5ScxZk]} session valid
ziti-controller   | [80037.489]   ERROR ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError [ch{UVUc5ScxZk}->u{classic}->i{M6LV}]: {error=[service 4ryLLcOB6TGozmLZEaOufl has no terminators] routerId=[UVUc5ScxZk] operation=[tunnel.create.circuit]} responded with error
ziti-controller   | [80038.714]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] now=[2023-10-03 22:13:58.713459069 +0000 UTC m=+64.334021221]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80038.714]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] now=[2023-10-03 22:13:58.71363799 +0000 UTC m=+64.334200222]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80038.714]   DEBUG ziti/controller/handler_edge_ctrl.(*baseTunnelRequestContext).ensureSessionForService: {router=[ziti-home-edge-router] operation=[tunnel.create.circuit] routerId=[UVUc5ScxZk] sessionType=[Dial] sessionId=[cln6irmy800gc01ppkh9h9due]} session valid
ziti-controller   | [80038.714]   ERROR ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError [ch{UVUc5ScxZk}->u{classic}->i{M6LV}]: {operation=[tunnel.create.circuit] error=[service 4ryLLcOB6TGozmLZEaOufl has no terminators] routerId=[UVUc5ScxZk]} responded with error
ziti-controller   | [80038.788]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033] now=[2023-10-03 22:13:58.7884165 +0000 UTC m=+64.408978692]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80038.789]   DEBUG ziti/controller/model.(*identityStatusMap).HasEdgeRouterConnection: {now=[2023-10-03 22:13:58.788605462 +0000 UTC m=+64.409167654] identityId=[UVUc5ScxZk] expiresAt=[2023-10-03 22:14:46.922159881 +0000 UTC m=+112.542722033]} reporting identity from active ER conn pool: timedout
ziti-controller   | [80038.789]   DEBUG ziti/controller/handler_edge_ctrl.(*baseTunnelRequestContext).ensureSessionForService: {routerId=[UVUc5ScxZk] sessionType=[Dial] sessionId=[cln6irmy800gc01ppkh9h9due] operation=[tunnel.create.circuit] router=[ziti-home-edge-router]} session valid
ziti-controller   | [80038.789]   ERROR ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError [ch{UVUc5ScxZk}->u{classic}->i{M6LV}]: {routerId=[UVUc5ScxZk] operation=[tunnel.create.circuit] error=[service 4ryLLcOB6TGozmLZEaOufl has no terminators]} responded with error

For zrok it seems that the TCP 3-way handshake doesn't even complete and I see no logs. The curl connection just "hangs" on the client side.

Restarting the edge router container seems to make ZAC available, but zrok is still down. Here are the logs: https://filebin.net/nrdfoy7pvw4ijcsp/logs.txt

I confirmed that the hostnames defined in the host configs are reachable from the network namespace of the edge router. The service configuration page in ZAC shows this for zrok:
image

Edit: I rolled back the router to version 0.30.3 and I have the same issue :/.

Any ideas what could be happening here?

In your logs i see:

cannot forward payload, no destination for circuit

I think that's an indicator that the identity forwarding traffic is not able to connect to the final destination.

I see from the logs successful connections to 172.18.0.6:8443.I assume that's your ZAC?

since it looks like traffic is making it to the router-side, if this was me i would:

  • inspect my host.v1 configs, find the one i'm interested in (i assume zrok) and look at the offload address
  • log onto my router and verify i can probe that port specified in the host.v1 config
  • if i can probe that port, i'd then do a tcpdump for that ip and look at what's happening in tcpdump, presumably there will be traffic

Any chance a config change happened or the service is on a new IP/port?

I ended up running the Windows Edge Client and it was able to intercept the connection and connect to the zrok controller. For some reason the tunneling edge router I had setup as a local gateway on my home network wasn't working properly.

I also rebooted the cloud VPS that has the controller and another tunneling edge router. The services (zac/zrok) failed to be exposed again (no terminators). I then restarted all the containers and it started working.

So here's a question -- if at the time the tunneller edge router comes up it cannot connect to the hosted services, does it just mark them as down and never retry? I think my intermittent failures are a result of the order of initialization of various containers on startup.

Are you saying the router was reporting something like "service has no terminators" after rebooting the VPS? Yes, that does sound like what the other user reported on that browzer thread. Restarting the router fixes this issue?

"It" being the tunneler? No that doesn't sound like what I would expect. I would expect to try forever over and over but it might be that the tunneler hosting a service tries to bind a few times and if it has no routers, i do think it gives up. If that's the case, it sounds like a router reconnect event should 'kickstart' those back off again. That might be a bug/feature if that's the case. That's probably a situation we just never tested.

Indeed, restarting the router fixed it.

Yeah, in this case it's the tunneling feature of the router (I haven't tried this with ziti-host yet). After rebooting the machine the services weren't reachable and I saw those "service has no terminator" messages in the logs (not sure if it was the controller or router). Restarting the router fixed it.

This is a bug I have hit before, but I have never been able to reliably reproduce. @plorenz - fyi. I don't know how to trigger it in a "simple" way (so as to make it something that could be debugged)

I can reproduce it. Just start the controller+router first. Try to access the service and you get:

ziti-controller   | [73184.792]   ERROR ziti/controller/handler_edge_ctrl.(*baseRequestHandler).returnError [ch{UVUc5ScxZk}->u{classic}->i{8R9n}]: {error=[service 4ryLLcOB6TGozmLZEaOufl has no terminators] routerId=[UVUc5ScxZk] operation=[tunnel.create.circuit]} responded with error

Then start the service container and see that it's still inaccessible and the edge router tunneler is not trying to re-establish the connection when you attempt to connect to it again.

Restart edge router and now it should work.

There's something special about the situation. I can't tell you how many times I've started a controller, started a router, added a bunch of services, ran a sample and it worked fine. Then did it all over again, once in a while, only to have it fail.

I'll try again without the additional "setup" steps since that is what seems to maybe be the difference.

Thanks for the tip, I'll see how it goes.

In my case the service was already defined when the router started, so maybe that is the difference?

Oh, just noticed Race condition in xgress_edge_tunnel tunneller at start but not seen in pre-compiled binary · Issue #1414 · openziti/ziti · GitHub -- could also be related?

Yah, we were talking about whether that could be possibly related. It'd also explain (ish) why I rarely see it since that person also reports the compiled binary behaves differently than the released binary.

It might be related. It probably will take us/me a bit of testing/finagling to see if I can reliably reproduce the issue. Stay tuned...

1 Like

If anyone could push a container with the fix I'd be happy to re-test.

That'd be helpful. I'll see if we can update a 'latest'. Which container image are you referring to?

openziti/ziti-router / openziti/ziti-controller

Hi @mvelbaum. Looks like we have :release-next tags which are updated when merges happen to our "next release" branch.

You can change over to using the latest :release-next and give it a try.

Thanks!

The fix has not been merged to release-next yet - GitHub - openziti/ziti at fix-xctrl-startup-race

Hi @mvelbaum. FYI, the fix was merged yesterday and should be out there now on the :release-next tag

Sorry for the late response, things have been hectic here lately.
I have deployed the version with the fix and have tried restarting the services a couple of times -- so far so good!

Will do some more tests in the coming days. Thanks!