Troubleshooting private router with error "host for token {UUID} not found"

This is the part of my network I’m having trouble with.

                                       ┌───────────────────┐
                              edge link│                   │
                                 ┌─────┤  edge endpoint 5  │
                                 │     │  #router1-policy  │
                                 │     └───────────────────┘
                                 │
                                 │
                                 │edge listen
                                 │10.11.12.253
                         ┌───────▼────────────┐
                         │                    │
                         │  private router 1  │
                         │                    │
                         └───────────┬────────┘
                                     │fabric link
                                     │
                                     │
                                     │
                                     │SNAT
─────────────────────────────────────▼─────────────────────────────────
                                     │
                                     │
                                     │
     internet                        │
                                     │
                                     │      ┌───────────────────┐
                                     │      │                   │
                            ┌────────┴──────►  public router 0  │
                            │  fabric listen│                   │
                            │        1.2.3.4└───────────────────┘
                            │
                            │
                            │
                            │
                            │
────────────────────────────▲──────────────────────────────────────────
                            │SNAT
                            │
                            │
                            │
                            │fabric link
                         ┌──┴─────────────────┐
                         │                    │
                         │  private router 2  │
                         │                    │
                         └───────────▲────────┘
                                     │edge listen
                                     │192.168.0.253
                                     │
                                     │
                                     │        ┌───────────────────┐
                                     │        │                   │
                                     ├────────┤  edge endpoint 5  │
                                     │        │  #router2-policy  │
                                     │        └───────────────────┘
                                     │
                                     │
             ┌───────────────────┐   │
             │                   │   │
             │  edge endpoint 4  ├───┘
             │  #router2-policy  │
             └───────────────────┘

As soon as I changed the router policy to only allow edge endpoints to connect to the private router, instead of the public router’s edge listener, then I started to see this error duplicated on the private routers’ and the controller’s logs.

May 28 19:30:55 nc-kencloud1-kencloud-1652209281 ziti-controller[12874]: {"context":"ch{MPyreGPG3Y}-\u003eu{classic}-\u003ei{g0j9}","error":"exceeded maximum [3] retries creating circuit [c/ftbyH3PbS]: error creating route for [s/ftbyH3PbS] on [r/yjIbNyPG6] (error creating route for [c/ftbyH3PbS]: host for token 'd716c93c-2183-4b21-9c
b4-a9839a21235c' not found)","file":"github.com/openziti/edge@v0.21.127/controller/handler_edge_ctrl/common.go:78","func":"github.com/openziti/edge/controller/handler_edge_ctrl.(*baseRequestHandler).returnError","level":"error","msg":"responded with error","operation":"create.circuit","routerId":"MPyreGPG3Y","time":"2022-05-28T19:30:5
5.594Z","token":"a8569b16-f424-4297-883d-d7a116660aac"}

I suspect this is similar to a no terminators error based upon a quick glance at the Go context where this error occurs (link to “main” branch because the version tags in repo “edge” don’t match the semver)

$ ./ziti-router version
v0.25.2

I never figured out precisely what is this UUID that the router and controller (repeating?) that can’t be found: d716c93c-2183-4b21-9c b4-a9839a21235c. It says “host for token … not found”.

The services still work if I allow the hosting edge endpoints to use the public router, and I verified the private edge listener is reachable on the underlay from the hosting edge endpoint.

My overall goal is to force latency-sensitive hosting edge endpoints to always use the private router at their same locations. I am aware they will typically do this automatically, but I wanted to force it to work for learning and testing because it didn’t seem to happen automatically when my router policy was less strict and allowed the endpoints to choose between any private or public router they could reach.

I forgot to mention the reason I created the topic is that I found the error when the services stopped working completely with the router policy change I mentioned.

Have you run ziti edge policy-advisor for both services and identities and do all report “OK”? Is there anything useful in the output that might be a pointer?

Can you confirm that your private endpoints are able to connect to their private edge router? If they are, I think maybe a service edge router policy is incorrect. I would think policy-advisor for services might show this?

Sorry, I also forgot to mention that policy advisor says “OKAY” for both services and identities. Yes, I had confirmed the private edge listener is reachable by the hosting edge endpoint that should have provided the terminator. Also, I confirmed the dialing/client edge endpoint is saying “no terminators”, and so I suspect the “host for token … not found” error on the private edge router is a clue.

If you ever see a “no terminator” error that indicates to me that the identity that is supposed to host the service is somehow blocked/not allowed to host the service. Can you start by making two edge router policies which allow both of your endpoints to access the ‘public’ router and remove the blanket “all/public” policy that you may have. Once you have two policies allowing access to the public router, can you then make one side ‘private’ while the other stays public and see if we can isolate which policy might be causing a problem? Does this test make sense to you? I’m trying to figure out if the issue is on the ‘client’ side or the ‘host’ side. My assertion is that it’s going to be on the ‘bind/host’ side. If you can keep your ‘client’ side private like you’re doing we can focus on the ‘host/bind’ side.

In the process of creating the test setup you suggested it suddenly started working. I suspect that there was either a long delay for a policy change to become effective or I’d overlooked something that I didn’t the second time and can’t seem to reproduce.

I have some combinations of edge endpoints and services that I wish to require a local private edge router and this had the desired effect of reducing latency noticeably. Huzzah!

Thanks for sharing your journey with this one. I found it very interesting

1 Like

Thanks for sharing this @TheLumberjack … this is a great way to diagnose issues.