We are trying a PoC which involves using openziti to allow two pods in different k8s clusters to communicate. The k8s clusters, each have only one node. For it we are using the helm charts, and specifically the ziti-edge-tunnel.
While we are able to establish a connection between the two pods, we see a strange behavior.
Domain names of the ziti service are not always resolved. For example, from inside a pod, running curl http://mariadb.service:3306 resolves almost half of the time, while the other half it returns that it cannot resolve the hostname. Do note that this problem does not occur when using the curl command, from the host itself (i.e., not in a pod).
The CoreDNS was configured as in the instructions, but changing the your.ziti.domain to mariadb.service. We tried both 100.64.0.2 (as in the instructions) and 100.64.0.1 (which is the ziti created interface in the host). Both have the same behavior.
Any thoughts?
edit: The logs of the ziti-edge-tunnel also show various errors.
In the bind side, it seems like it is losing connection to the controller and restarting:
(8)[2023-12-12T14:15:43.605Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-12T14:15:43.605Z] WARN tunnel-cbs:ziti_tunnel_ctrl.c:781 on_ziti_event() ziti_ctx controller connections failed: ziti controller is not available
(8)[2023-12-12T14:15:43.605Z] INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is ziti controller is not available
(8)[2023-12-12T14:15:43.605Z] ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1202 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] failed to connect to controller due to ziti controller is not available
(8)[2023-12-12T14:15:53.788Z] INFO tunnel-cbs:ziti_tunnel_ctrl.c:767 on_ziti_event() ziti_ctx[ocm-1] connected to controller
(8)[2023-12-12T14:15:53.788Z] INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is OK
In the dial side, there is this error periodically for each service define in ziti, regardless of policy:
ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-12T14:12:37.175Z] WARN ziti-sdk:bind.c:213 session_cb() server[0.230] failed to get session for service[mariadb_service]: -16/CONTROLLER_UNAVAILABLE
Maybe you should disregard the logs above...because we cannot reproduce them
We did a completely new clean installation two times and the behavior is the same as mentioned before (nslookup fails some of the times we try). However the logs from the bind and dial sides are the following:
Bind: no errors
Dial: We get more or less the following in a repeating fashion:
(8)[2023-12-13T12:25:23.840Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:23.840Z] WARN tunnel-cbs:ziti_tunnel_ctrl.c:781 on_ziti_event() ziti_ctx controller connections failed: ziti controller is not available
(8)[2023-12-13T12:25:23.840Z] INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is ziti controller is not available
(8)[2023-12-13T12:25:23.840Z] ERROR ziti-edge-tunnel:ziti-edge-tunnel.c:1202 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] failed to connect to controller due to ziti controller is not available
(8)[2023-12-13T12:25:33.842Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:33.842Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:33.842Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:39.106Z] INFO tunnel-cbs:ziti_dns.c:500 format_resp() found record[100.64.0.3] for query[1:mariadb.service]
(8)[2023-12-13T12:25:43.844Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:43.844Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:43.844Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:25:53.845Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:25:53.845Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:25:53.845Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:26:03.847Z] ERROR tlsuv:tcp_src.c:113 connect failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti.c:1307 edge_routers_cb() ztx[0] failed to get current edge routers: CONTROLLER_UNAVAILABLE/unknown node or service
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
(8)[2023-12-13T12:26:14.039Z] INFO tunnel-cbs:ziti_tunnel_ctrl.c:767 on_ziti_event() ziti_ctx[k8s213] connected to controller
(8)[2023-12-13T12:26:14.039Z] INFO ziti-edge-tunnel:ziti-edge-tunnel.c:1147 on_event() ztx[/ziti-edge-tunnel/ziti-run-node-ziti-edge-tunnel-identity.json] context event : status is OK
The interesting part is that this behavior begins only when changing the coredns file according to the instructions.
It seems that the connection to the controller is lost again and again, but why?
The meaning of mariadb.service is to forward any DNS record lookup within the mariadb.service zone, i.e. the anchor record @ and all labels subordinate to mariadb.service to the tunneler's nameserver on the current node. You confirmed there's one node per cluster in this PoC architecture, so there's no need for a NodeLocal DNSCache in this particular case.
Importantly, the daemonset is not configured to answer its own DNS queries because, though it may inherit host DNS and use CoreDNS, the DNS records that it queries do not match the forwarded zone in the Corefile.
You're seeing successful interception by all pods on all worker nodes where the daemonset is scheduled when they request DNS records like *mariadb.service
Finally, the remaining potential issue is frequent error logs emitted by the tunneler, but only when the forwarding zone rule is in effect.
Possible causes:
Is there another, more specific, zone specified in the Corefile that matches the DNS record of the Ziti controller?
What is the daemonset's dnsConfig? Do the symptoms change if a different DNS policy is employed?
Thanks for the answer! You have correctly understood our setup.
Based on your first comment about possible causes we tried to make the original coredns configuration more specific and changed it from the forward . /etc/resolv.conf to forward . 8.8.8.8.
Essentially, it should have been the same, as in /etc/resolv.conf, we use 127.0.0.53, which forwards it to the systemd-resolved in ubuntu (if I am not mistaken). For some reason I cannot understand this solution works.
I.e., using this configuration, does not show any problems:
Regarding separate ConfigMaps for CoreDNS: some K8s distros, like K3S, merges additional ConfigMaps to achieve configuration "include" functionality. That way, the main Corefile can remain unmodified while injecting Corefile changes as "included" ConfigMaps. It's been a minute since I used that feature, but I remember finding it useful. Your K8s distro might have such a feature too.
Regarind resolved: I take it that the node's OS is Ubuntu and using systemd-resolved in the default configuration, which is probably stub-resolver with /etc/resolv.conf being a symlink to the resolved-managed configuration file like this.
I think you're saying that, in the default Corefile with forward . /etc/resolv.conf, errors like these are emitted by ziti-edge-tunnel run, but things function normally in that default configuration despite the concerning ERROR messages.
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti_ctrl.c:162 ctrl_resp_cb() ctrl[*REDACTED*] request failed: -3008(unknown node or service)
(8)[2023-12-13T12:26:03.847Z] ERROR ziti-sdk:ziti.c:1099 update_services() ztx[0] failed to get service updates err[CONTROLLER_UNAVAILABLE/unknown node or service] from ctrl[https://*REDACTED*:8441/edge/client/v1]
However, when you switch the default zone forward rule to Google's recursive nameserver 8.8.8.8, then the ERROR messages are not emitted, and Ziti interception and name resolution continues to work normally.
It's also possible you were saying that things don't work unless you forward to 8.8.8.8. In that case, please share the terminal transcript from running these diagnostic commands, redacting at will.
ls -l /etc/resolv.conf
cat /etc/resolv.conf
resolvectl dns
resolvectl domain
ip addr sh ziti0