Troubleshooting OpenZiti Tunnel: Edge Router and Service Configuration Issues

Hii @TheLumberjack For now I have decided to ignore router-services & removed it..

These are the current routers

$ ziti edge list edge-routers
╭───────────┬────────────────┬────────┬───────────────┬──────┬────────────────╮
│ ID        │ NAME           │ ONLINE │ ALLOW TRANSIT │ COST │ ATTRIBUTES     │
├───────────┼────────────────┼────────┼───────────────┼──────┼────────────────┤
│ SolPOIazd │ router-private │ true   │ true          │    0 │ router-private │
│ zFVPOI2ib │ router-public  │ true   │ true          │    0 │ router-public  │
╰───────────┴────────────────┴────────┴───────────────┴──────┴────────────────╯
results: 1-2 of 2

I have set these policies:

ziti edge create edge-router-policy router-private-router-policy \
--edge-router-roles "#router-private" \
--identity-roles "#EC2-private" \
--semantic "AllOf"

ziti edge create edge-router-policy router-public-router-policy \
--edge-router-roles "#router-public" \
--identity-roles "#EC2-public" \
--semantic "AllOf"

ziti edge create service-edge-router-policy all-services-on-all-routers \
      --edge-router-roles '#all' \
      --service-roles '#all'

No Errors are seen in the screen session of EC2-public but in the Screen session of EC2-private I get this error

(17515)[        9.823]   ERROR ziti-sdk:connect.c:1071 connect_reply_cb() conn[0.0/Wy7DGYVj/Connecting](apache-service) failed to connect, reason=can't route from SolPOIazd -> zFVPOI2ib

Please run list your links. I suspect your routers are not linking together. Check the advertised addresses, check the routers are set for link listeners, verify the link listener address is available from the router that should dial it.

ziti fabric list links

Ohh this is empty

$ ziti fabric list links
╭────┬────────┬──────────┬─────────────┬─────────────┬─────────────┬───────┬────────┬───────────╮
│ ID │ DIALER │ ACCEPTOR │ STATIC COST │ SRC LATENCY │ DST LATENCY │ STATE │ STATUS │ FULL COST │
├────┼────────┼──────────┼─────────────┼─────────────┼─────────────┼───────┼────────┼───────────┤
╰────┴────────┴──────────┴─────────────┴─────────────┴─────────────┴───────┴────────┴───────────╯
results: none

Is there any command or something to set the links or it should happen automatically na? BTW below are the values.yml used for router-private & router-public for your reference.

router-private.yml

ctrl:
  endpoint: ziti-controller.example.com:443
  advertisedHost: ziti-router-private.example.com

# Edge configuration for external identities
edge:
  advertisedHost: ziti-router-private.example.com
  advertisedPort: 443
  service:
    type: LoadBalancer  
    annotations:
      external-dns.alpha.kubernetes.io/hostname: ziti-router-private.example.com
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
      service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-0e9e6f9fce67feba1"
      service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true"
  ingress:
    enabled: false

# Link listeners for router-to-router communication (internal)
linkListeners:
  transport:
    advertisedHost: ziti-router-transport-private.ziti-router.svc.cluster.local
    advertisedPort: 443
    service:
      enabled: true
      type: ClusterIP  # All routers are internal; no external exposure
    ingress:
      enabled: false  # Not needed as routers are internal

# Persistence for router data
persistence:
  enabled: true
  accessMode: ReadWriteOnce
  size: 1Gi
  storageClass: ebs-sc

router-public.yml

ctrl:
  endpoint: ziti-controller.example.com:443
  advertisedHost: ziti-router-public.example.com

# Edge configuration for external identities
edge:
  advertisedHost: ziti-router-public.example.com
  advertisedPort: 443
  service:
    type: LoadBalancer  
    annotations:
      external-dns.alpha.kubernetes.io/hostname: ziti-router-public.example.com
      service.beta.kubernetes.io/aws-load-balancer-internal: "false"
      service.beta.kubernetes.io/aws-load-balancer-security-groups: "sg-0e9e6f9fce67feba1"
      service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "true"
  ingress:
    enabled: false


# Link listeners for router-to-router communication (internal)
linkListeners:
  transport:
    advertisedHost: ziti-router-transport-public.ziti-router.svc.cluster.local
    advertisedPort: 443
    service:
      enabled: true
      type: ClusterIP  # All routers are internal; no external exposure
    ingress:
      enabled: false  # Not needed as routers are internal

# Persistence for router data
persistence:
  enabled: true
  accessMode: ReadWriteOnce
  size: 1Gi
  storageClass: ebs-sc

@TheLumberjack I checked values.yml of both the routers when I posted in the last message and found that I had entered the wrong advertisedHost for linkListeners. So I changed them to:

ziti-router-transport-public.ziti-router.svc.cluster.local ==> ziti-router-public-release-transport.ziti-router.svc.cluster.local
ziti-router-transport-private.ziti-router.svc.cluster.local ==> ziti-router-private-release-transport.ziti-router.svc.cluster.local

Later I reinstalled both the routers with the updated values.yml & then I out the output for the below command.

$ ziti fabric list links
╭────────────────────────┬───────────────┬────────────────┬─────────────┬─────────────┬─────────────┬───────────┬────────┬───────────╮
│ ID                     │ DIALER        │ ACCEPTOR       │ STATIC COST │ SRC LATENCY │ DST LATENCY │ STATE     │ STATUS │ FULL COST │
├────────────────────────┼───────────────┼────────────────┼─────────────┼─────────────┼─────────────┼───────────┼────────┼───────────┤
│ 6sRZrjtjhtkaQF8VtkIgdl │ router-public │ router-private │           1 │       3.7ms │       3.8ms │ Connected │     up │         7 │
╰────────────────────────┴───────────────┴────────────────┴─────────────┴─────────────┴─────────────┴───────────┴────────┴───────────╯

BTW I also checked the pod logs of both the routers.

The log was very much big.. So I have uploaded them..
public-router-pod.txt (720.7 KB)
private-router-pod.txt (702.6 KB)

Anyway to fix those issues?

When you see handshake failed messages, this generally indicates the PKI for some part of the overlay is incorrect. What are these IPs?

  • 10.0.1.248
  • 10.0.2.252
  • 10.0.2.33
  • 10.0.3.162
  • 10.0.3.85

You will see logs like this when a connection is attempted but the certificate presented didn't match. My guess is these are old routers trying to connect to the new routers. It could also be ziti-edge-tunnel instances which are no longer valid. The errors shouldn't be preventing your testing.

At this point, I expect you to not get the "can't route from" error. Are things working now, other than the errors?

Thanks for your response, @TheLumberjack.

When I changed this:

ziti-router-transport-public.ziti-router.svc.cluster.local ==> ziti-router-public-release-transport.ziti-router.svc.cluster.local
ziti-router-transport-private.ziti-router.svc.cluster.local ==> ziti-router-private-release-transport.ziti-router.svc.cluster.local

I got the following output when running the ziti fabric list links command:

$ ziti fabric list links
╭────────────────────────┬───────────────┬────────────────┬─────────────┬─────────────┬─────────────┬───────────┬────────┬───────────╮
│ ID                     │ DIALER        │ ACCEPTOR       │ STATIC COST │ SRC LATENCY │ DST LATENCY │ STATE     │ STATUS │ FULL COST │
├────────────────────────┼───────────────┼────────────────┼─────────────┼─────────────┼─────────────┼───────────┼────────┼───────────┤
│ 6sRZrjtjhtkaQF8VtkIgdl │ router-public │ router-private │           1 │       3.7ms │       3.8ms │ Connected │     up │         7 │
╰────────────────────────┴───────────────┴────────────────┴─────────────┴─────────────┴─────────────┴───────────┴────────┴───────────╯

After this change, the service became reachable, and the curl command worked.

Regarding the logs, I plan to create a fresh EKS cluster and set everything up from scratch. This will help me determine whether the error logs about the connection attempts are due to old identities, routers, or some other issue. If I can't resolve it, I'll create a new post for that topic, as it seems to be a separate issue.

1 Like

There we go - excellent! we did it! :slight_smile: Thanks for letting me know

1 Like

@TheLumberjack
@scareything
Thanks to both of you for spending your valuable time and helping me with this case. :smiley:

1 Like