Helm ziti-edge-tunnel DNS not always resolving

Hello!

We are trying a PoC which involves using openziti to allow two pods in different k8s clusters to communicate. The k8s clusters, each have only one node. For it we are using the helm charts, and specifically the ziti-edge-tunnel.

While we are able to establish a connection between the two pods, we see a strange behavior.
Domain names of the ziti service are not always resolved. For example, from inside a pod, running curl http://mariadb.service:3306 resolves almost half of the time, while the other half it returns that it cannot resolve the hostname. Do note that this problem does not occur when using the curl command, from the host itself (i.e., not in a pod).

The CoreDNS was configured as in the instructions, but changing the your.ziti.domain to mariadb.service. We tried both 100.64.0.2 (as in the instructions) and 100.64.0.1 (which is the ziti created interface in the host). Both have the same behavior.

Any thoughts?

edit: The logs of the ziti-edge-tunnel also show various errors.

In the bind side, it seems like it is losing connection to the controller and restarting:

(8)[2023-12-12T14:15:43.605Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-12T14:15:43.605Z]    WARN tunnel-cbs:ziti_tunnel_ctrl.c:781 on_ziti_event() ziti_ctx controller connections failed: ziti controller is not available
(8)[2023-12-12T14:15:43.605Z]    INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is ziti controller is not available
(8)[2023-12-12T14:15:43.605Z]   ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1202 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] failed to connect to controller due to ziti controller is not available
(8)[2023-12-12T14:15:53.788Z]    INFO tunnel-cbs:ziti_tunnel_ctrl.c:767 on_ziti_event() ziti_ctx[ocm-1] connected to controller
(8)[2023-12-12T14:15:53.788Z]    INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is OK

In the dial side, there is this error periodically for each service define in ziti, regardless of policy:

ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:12:37.175Z]    WARN ziti-sdk:bind.c:213 session_cb() server[0.230] failed to get session for service[mariadb_service]: -16/CONTROLLER_UNAVAILABLE
1 Like

Hello again,

Maybe you should disregard the logs above...because we cannot reproduce them

We did a completely new clean installation two times and the behavior is the same as mentioned before (nslookup fails some of the times we try). However the logs from the bind and dial sides are the following:

Bind: no errors

Dial: We get more or less the following in a repeating fashion:

(8)[2023-12-13T12:25:23.840Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:23.840Z]    WARN tunnel-cbs:ziti_tunnel_ctrl.c:781 on_ziti_event() ziti_ctx controller connections failed: ziti controller is not available
(8)[2023-12-13T12:25:23.840Z]    INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is ziti controller is not available
(8)[2023-12-13T12:25:23.840Z]   ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1202 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] failed to connect to controller due to ziti controller is not available
(8)[2023-12-13T12:25:33.842Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:33.842Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:39.106Z]    INFO tunnel-cbs:ziti_dns.c:500 format_resp() found record[100.64.0.3] for query[1:mariadb.service]
(8)[2023-12-13T12:25:43.844Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:43.844Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:53.845Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:53.845Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:26:03.847Z]   ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:26:14.039Z]    INFO tunnel-cbs:ziti_tunnel_ctrl.c:767 on_ziti_event() ziti_ctx[k8s213] connected to controller
(8)[2023-12-13T12:26:14.039Z]    INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is OK

The interesting part is that this behavior begins only when changing the coredns file according to the instructions.

It seems that the connection to the controller is lost again and again, but why?

Anyone has any hint?

1 Like

Do I understand correctly?

Your CoreDNS Corefile sets a forwarding rule for the mariadb.service zone like this:

                mariadb.service:53 {
                    errors
                    cache 30
                    forward . 100.64.0.2
                }

The meaning of mariadb.service is to forward any DNS record lookup within the mariadb.service zone, i.e. the anchor record @ and all labels subordinate to mariadb.service to the tunneler's nameserver on the current node. You confirmed there's one node per cluster in this PoC architecture, so there's no need for a NodeLocal DNSCache in this particular case.

Importantly, the daemonset is not configured to answer its own DNS queries because, though it may inherit host DNS and use CoreDNS, the DNS records that it queries do not match the forwarded zone in the Corefile.

You're seeing successful interception by all pods on all worker nodes where the daemonset is scheduled when they request DNS records like *mariadb.service

Finally, the remaining potential issue is frequent error logs emitted by the tunneler, but only when the forwarding zone rule is in effect.

Possible causes:

  • Is there another, more specific, zone specified in the Corefile that matches the DNS record of the Ziti controller?
  • What is the daemonset's dnsConfig? Do the symptoms change if a different DNS policy is employed?

doc about using the privileged tunneler as a node proxy (is a copy of ziti-edge-tunnel Helm chart's README)

Hello again,

Thanks for the answer! You have correctly understood our setup.

Based on your first comment about possible causes we tried to make the original coredns configuration more specific and changed it from the forward . /etc/resolv.conf to forward . 8.8.8.8.
Essentially, it should have been the same, as in /etc/resolv.conf, we use 127.0.0.53, which forwards it to the systemd-resolved in ubuntu (if I am not mistaken). For some reason I cannot understand this solution works.

I.e., using this configuration, does not show any problems:

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 8.8.8.8
        cache 30
        loop
        reload
        loadbalance
        mariadb.service:53 {
            forward . 100.64.0.2
        }
    }

kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system

Do you understand why that might be?

edit: also note that adding a separate configmap, does not seem to work too. Not sure why...

Regarding separate ConfigMaps for CoreDNS: some K8s distros, like K3S, merges additional ConfigMaps to achieve configuration "include" functionality. That way, the main Corefile can remain unmodified while injecting Corefile changes as "included" ConfigMaps. It's been a minute since I used that feature, but I remember finding it useful. Your K8s distro might have such a feature too.

Regarind resolved: I take it that the node's OS is Ubuntu and using systemd-resolved in the default configuration, which is probably stub-resolver with /etc/resolv.conf being a symlink to the resolved-managed configuration file like this.

❯ ll /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Oct 23 14:17 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

I think you're saying that, in the default Corefile with forward . /etc/resolv.conf, errors like these are emitted by ziti-edge-tunnel run, but things function normally in that default configuration despite the concerning ERROR messages.

(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z]   ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]

However, when you switch the default zone forward rule to Google's recursive nameserver 8.8.8.8, then the ERROR messages are not emitted, and Ziti interception and name resolution continues to work normally.


It's also possible you were saying that things don't work unless you forward to 8.8.8.8. In that case, please share the terminal transcript from running these diagnostic commands, redacting at will.

ls -l /etc/resolv.conf
cat /etc/resolv.conf
resolvectl dns
resolvectl domain
ip addr sh ziti0