Weird connection problems

Hello everyone,
The OpenZiti network itself is running excellently. However, it happens regularly that no terminators are created/renewed for services. However, if I manually restart the container controller, router and tunneler, all terminators are created and the connections all work again.

I am not sure what exactly the problem is or how I should approach the debugging. I always find the following error message in the router log when terminators are missing and a connection to the services is no longer possible:

[1673.602] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {connId=[0] error=[service h01F3txM8ahZufijK5KLg has no terminators] token=[45cb36f6-398f-4582-8768-484d21756156] type=[EdgeConnectType] chSeq=[3] edgeSeq=[0]} failed to dial fabric
[1673.619] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {error=[service h01F3txM8ahZufijK5KLg has no terminators] connId=[1] type=[EdgeConnectType] chSeq=[4] token=[45cb36f6-398f-4582-8768-484d21756156] edgeSeq=[0]} failed to dial fabric
[1674.136] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {edgeSeq=[0] error=[service h01F3txM8ahZufijK5KLg has no terminators] token=[45cb36f6-398f-4582-8768-484d21756156] connId=[2] type=[EdgeConnectType] chSeq=[5]} failed to dial fabric
[1674.196] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {connId=[3] type=[EdgeConnectType] chSeq=[6] edgeSeq=[0] token=[45cb36f6-398f-4582-8768-484d21756156] error=[service h01F3txM8ahZufijK5KLg has no terminators]} failed to dial fabric
[1674.748] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {edgeSeq=[0] error=[service h01F3txM8ahZufijK5KLg has no terminators] connId=[4] token=[45cb36f6-398f-4582-8768-484d21756156] type=[EdgeConnectType] chSeq=[7]} failed to dial fabric
[1674.802] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {chSeq=[8] edgeSeq=[0] token=[45cb36f6-398f-4582-8768-484d21756156] error=[service h01F3txM8ahZufijK5KLg has no terminators] connId=[5] type=[EdgeConnectType]} failed to dial fabric
[1675.354] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {token=[45cb36f6-398f-4582-8768-484d21756156] connId=[6] type=[EdgeConnectType] chSeq=[9] edgeSeq=[0] error=[service h01F3txM8ahZufijK5KLg has no terminators]} failed to dial fabric
[1675.399] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {chSeq=[10] edgeSeq=[0] error=[service h01F3txM8ahZufijK5KLg has no terminators] token=[45cb36f6-398f-4582-8768-484d21756156] connId=[7] type=[EdgeConnectType]} failed to dial fabric
[1675.635] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {edgeSeq=[0] connId=[8] type=[EdgeConnectType] chSeq=[11] error=[service h01F3txM8ahZufijK5KLg has no terminators] token=[45cb36f6-398f-4582-8768-484d21756156]} failed to dial fabric
[1675.921] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {edgeSeq=[0] connId=[9] type=[EdgeConnectType] token=[45cb36f6-398f-4582-8768-484d21756156] chSeq=[12] error=[service h01F3txM8ahZufijK5KLg has no terminators]} failed to dial fabric
[1675.934] WARNING ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{2eXq}]: {chSeq=[13] edgeSeq=[0] connId=[10] token=[45cb36f6-398f-4582-8768-484d21756156] error=[service h01F3txM8ahZufijK5KLg has no terminators] type=[EdgeConnectType]} failed to dial fabric

The controller config looks like this, here I have put the management API on its own port:

v: 3
db:                     "/ziti-controller/bbolt.db"
identity:
  cert:        "pki/intermediate/certs/client.chain.pem"
  server_cert: "pki/intermediate/certs/server.chain.pem"
  key:         "pki/intermediate/keys/server.key"
  ca:          "pki/root/certs/root.cert"
ctrl:
  options:
    advertiseAddress: tls:openziti.my.domain:443
  listener:             tls:0.0.0.0:1280
healthChecks:
  boltCheck:
    interval: 30s
    timeout: 20s
    initialDelay: 30s
edge:
  api:
    sessionTimeout: 30m
    address: openziti.my.domain:443
  enrollment:
    signingCert:
      cert: pki/intermediate/certs/intermediate.cert
      key:  pki/intermediate/keys/intermediate.key
    edgeIdentity:
      duration: 180m
    edgeRouter:
      duration: 180m
web:
  - name: public
    bindPoints:
      - interface: 0.0.0.0:1280
        address: openziti.my.domain:443
    identity:
      ca:          "pki/root/certs/root.cert"
      key:         "pki/intermediate/keys/server.key"
      server_cert: "pki/intermediate/certs/server.chain.pem"
      cert:        "pki/intermediate/certs/client.chain.pem"
    options:
      idleTimeout: 5000ms
      readTimeout: 5000ms
      writeTimeout: 100000ms
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
    apis:
      - binding: edge-client
        options: { }
  - name: private
    bindPoints:
      - interface: 0.0.0.0:8080
        address: mgmt.openziti.my.domain:443
    options:
      idleTimeout: 5000ms
      readTimeout: 5000ms
      writeTimeout: 100000ms
      minTLSVersion: TLS1.2
      maxTLSVersion: TLS1.3
    apis:
      - binding: edge-client
        options: { }
      - binding: edge-management
        options: { }
      - binding: fabric
        options: { }
      - binding: zac
        options:
          location: /ziti-console
          indexFile: index.html

And the router config looks like this:

v: 3
identity:
  cert:             "router.cert"
  server_cert:      "/ziti-router/router.server.chain.cert"
  key:              "/ziti-router/router.key"
  ca:               "/ziti-router/router.cas"
ctrl:
  endpoint:             tls:openziti.my.domain:443
link:
  dialers:
    - binding: transport
  listeners:
    - binding:          transport
      bind:             tls:0.0.0.0:3022
      advertise:        tls:router.openziti.my.domain:443
      options:
        outQueueSize:   4
listeners:
  - binding: edge
    address: tls:0.0.0.0:3022
    options:
      advertise: router.openziti.my.domain:443
      connectTimeoutMs: 5000
      getSessionTimeout: 60
  - binding: tunnel
    options:
      mode: host
edge:
  csr:
    country: US
    province: NC
    locality: Charlotte
    organization: NetFoundry
    organizationalUnit: Ziti
    sans:
      dns:
        - localhost
        - router.openziti.my.domain
        - openziti-router
      ip:
        - "127.0.0.1"
        - "::1"
forwarder:
  latencyProbeInterval: 0
  xgressDialQueueLength: 1000
  xgressDialWorkerCount: 128
  linkDialQueueLength: 1000
  linkDialWorkerCount: 32

I would be very grateful for any tips on what the problem is.

EDIT: Now are all tunnelers Connected to Controller and Router, but no Terminators created, but i dont know why.

first of all -- what versions of the software(controller/router/tunneler) are you using?

Hello, im using 1.1.15 for controller and router, and version 1.2.2 for the tunneler.

In the meantime, I have found the cause myself.
I have attached an os-posture-check to the bind policy, if I remove this check, the terminators are created.

The tunneler runs on a Linux server with the kernel 6.11.3-2-default and the posture-check-policy looks like this:

{
  “name": ‘check-os-version’,
  “typeId": ‘OS’,
  “roleAttributes":[
  ],
  “tags":{
  },
  “operatingSystems":[
    {
      “type": ‘Android’,
      “versions":[
        “>12.0.0"]
    },
    {
      “type": ‘Linux’,
      “versions":[
        “>6.0.0"]
    },
    {
      “type": ‘Windows’,
      “versions":[
        “>10.0.0"]
    },
    {
      “type": ‘macOS’,
      “versions":[
        “>23.0.0"]
    }]
}

So the Linux kernel check does not seem to work.

1 Like

Does anyone have any idea what could be wrong with the Posture Check Policy?

As soon as I attach this to a bind policy, I get the following error message with the affected tunneler (here: openziti/ziti-host:1.2.2):

[openziti-tunneler] | (2)[     5263.302]    WARN ziti-sdk:bind.c:246 session_cb() server[0.1](my-service.ziti) failed to get session for service[my-service.ziti]: -25/INVALID_POSTURE

I couldn't replicate the issue with a Linux OS version posture check, but I did encounter a console bug that causes the versions to lack comparison operators. Your JSON representation of the posture check doesn't have this problem, but it is something to be aware of if you're using the console to manage these.

Follow up thought: do you get the same result with >=? If not, it could be a bug that manifests when using >.

Yes i tried that also with >=, but doesnt work too.

Unfortunately, the log file does not provide any information that I could post here to narrow down the problem. The only thing I can say for sure is that if I remove the posture check, everything works without any problems.

You could try the posture check without other OS version conditions, only Linux, in case the problem manifests when a certain OS's version condition is present, e.g., macOS. My test that didn't encounter this problem used a single posture check with Linux and Windows version conditions and lowercase OS names.

The OS names in your example match this document: Posture Checks | OpenZiti, but the document's examples are uppercase, leading me to assume the OS names are case insensitive.

I have now tried the following options:

  • Capital letters only (WINDOWS, LINUX, ...)
  • “>” and “>=”
  • A separate Posture Check for each OS
  • Create posture checks via ZAC

Unfortunately, none of this worked and none of the log files indicate why the client (here an Android 15 as an example) is not allowed to establish a connection. Very frustrating.

I'm sorry this is proving to be tough for us to track down. This may be a shot in the dark, but I noticed that the quotes in your posture check policy looked like non-ascii characters. e.g. vs ". I'd expect this to result in parse errors so maybe the quotes were converted when you copied/pasted into this thread?

1 Like

Thanks for the reply and I totally understand that it is difficult to recognize an error without more detailed information.
About the tip with the quotation marks: That was just a copy-paste error, in the config it fits.

Is there any way to turn up the logging/debugging to be able to share more information here?

Alternatively, the only possibility that comes to my mind is to create a posture check that works for you with curl and then check it.

ziti-edge-tunnel set_log_level --loglevel DEBUG