I have given homarr running in kubernetes a ziti identity.
In the pod, I can successfully resolve a ziti service via dig or curl.
From what I understand, nextjs uses ICMP to resolve hostnames, so homarr throws this error:
2025-03-25T20:08:21.236Z info: Dispatching request https://auth.domain.com/realms/services/.well-known/openid-configuration (5 headers)
2025-03-25T20:08:21.276Z error: TypeError: fetch failed
at e.exports.hd (/app/apps/nextjs/.next/server/chunks/8287.js:1:129739)
at async o1 (/app/apps/nextjs/.next/server/chunks/8287.js:489:50086)
at async o3 (/app/apps/nextjs/.next/server/chunks/8287.js:489:52860)
at async o8 (/app/apps/nextjs/.next/server/chunks/8287.js:489:55276)
at async ae (/app/apps/nextjs/.next/server/chunks/8287.js:489:57629)
at async d (/app/apps/nextjs/.next/server/app/api/auth/[...nextauth]/route.js:1:3658)
at async tr.do (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:18:17582)
at async tr.handle (/app/node_modules/next/dist/compiled/next-server/app-route.runtime.prod.js:18:22212)
at async doRender (/app/node_modules/next/dist/server/base-server.js:1452:42)
The above error also had these properties on it:
{
cause: [Error: getaddrinfo ENOTFOUND auth.domain.com] {
errno: -3008,
code: 'ENOTFOUND',
syscall: 'getaddrinfo',
hostname: 'auth.domain.com'
}
}
2025-03-25T20:08:21.276Z error: Error: getaddrinfo ENOTFOUND auth.domain.com
The above error also had these properties on it:
{
errno: -3008,
code: 'ENOTFOUND',
syscall: 'getaddrinfo',
hostname: 'auth.domain.com'
}
This looks to me like "auth.domain.com" simply doesn't exist within the homarr pod. I've never heard of ICMP being used to resolve hostnames but getaddrinfo is a common system call to resolve a hostname to an IP.
Is auth.domain.com supposed to be a ziti intercept? should the identity be able to intercept the intercept? Or is this just some random k8s service resolution failure? It's not quite clear to me from your post which of those you expect.
Interesting. So if getaddrinfo fails from nextjs, the question would be "what DNS server does the app use". Is this a possible race condition of some kind where the tunneler needs to be up and running before homarr so that it doesn't end up failing like this?
I'm not entirely sure how to reproduce this particular issue. I'll ask around and see if anyone else has a thought.
From what I can tell, it uses the C standard library under the bonnet.
Actually, I have a really clever liveness probe configured that runs curl on a separate ziti host, so there must be a live connection prior to the pod being live.
I have the tunneler running as a sidecar with the tproxy argument.
I also just upgraded from version 1.4.2 to the latest 1.5.0, and the behaviour is the same.
One thing that you should be able to replicate is the domain failing to resolve.
They use a node docker image.
I would say to do this:
make a minimal docker compose file with that and the ziti-tunnel image
give it access to a ziti service
run apk add drill curl
run drill domain.com a bunch of times
It will resolve most of the time, and occasionally fail.
This server address is probably the default pod resolver provided by CoreDNS, indicating the pod where you ran drill auth.domain.com is correctly configured with both Ziti and default nameservers, and fell back to querying CoreDNS for some reason that's probably determined by an algorithm within the Next.js application.
The commands I ran were in a bash terminal (shelled into the pod), not part of the nextjs program.
I did try running dig instead of drill on loop, and curiously, only drill shows the intermittent failure.
You confirmed the tproxy sidecar's identity has dial permission for auth.domain.com, and you observed at least one failed DNS query for that name.
I'd expect DNS to begin working after the tproxy sidecar has obtained the list of authorized Ziti services it may dial, which may take a few moments after starting up.
As I mentioned earlier, I have a liveness probe on the ziti-tunnel sidecar that checks a ziti service.
This makes certain that the services assigned can be dialled.
It does not matter how long I wait after the pod is up.
Thanks for the troubleshooting assistance! When you override DNS with a host entry, does 192.168.1.xxx from your example represent that server's ClusterIP, pod IP, Ziti intercept address, or something else?
I've been working recently and extensively with ziti tunnel tproxy as a Kubernetes pod sidecar, i.e., a helper container providing a two-way proxy for the main application container using the openziti/ziti-tunnel container image.
My primary functional testing technique involves a busybox container (uses glibc, not musl) running an HTTP request loop that must resolve an address with Ziti DNS to succeed. I haven't instrumented that test yet and have not yet noticed any sporadic failures in casual spot checks of the results.
From the perspective of the application initiating the DNS queries against the Ziti nameserver provided by ziti tunnel tproxy, there is only the system call to the OS requesting an address for the intercept domain, and the pod specification determines the OS handling of timeouts, retries, search domains, etc.. You can use resolver options from the resolve.conf manual (resolv.conf(5) - Linux manual page). The getaddrinfo syscall will obey these parameters when the Ziti intercept address is queried.
apiVersion: v1
kind: Pod
metadata:
name: homarr
spec:
dnsPolicy: None
dnsConfig:
nameservers:
- "127.0.0.1"
- "10.43.0.10"
options:
- name: timeout
value: "5" # Set timeout to 5 seconds (default is usually 5)
- name: attempts
value: "5" # Set number of attempts to 5 (default is usually 2)
# You can also configure search domains and options if needed
containers:
- name: homarr
image: homarr/image
- name: ziti-tunnel
image: openziti/ziti-tunnel
ports:
- containerPort: 53
protocol: UDP
- containerPort: 53
protocol: TCP
Will you please say more about "a separate ziti host?" Does that mean it's not a Kubelet-initiated liveness probe, and more of a customized health check initiated by another Ziti identity against the same intercept address for the purpose of verifying that the Ziti service is healthy separately from the pod DNS and proxy?
Also, please let me know if any of the failures you observed happen to be using Alpine (musl, not glibc), because there are some known issues with DNS that may be resolved with configuration or upgrading Alpine (Debugging DNS Resolution | Kubernetes).
Homarr is using Alpine 3.21 and the DNS issue was fixed in 3.18.
The liveness probe is configured on the Homarr pod which has a Ziti sidecar loading an identity with dial permission for two services: auth.domain.com:80 and alwaysup.domain.com:80. The probe continually checks the "alwaysup" service is still responding OK, and Kubernetes deletes the pod if the liveness probe fails.
OK. I expect dig and drill to implement DNS directly, not rely on the OS syscalls like getaddrinfo, so I don't expect these troubleshooting tools to represent precisely the behavior of a client application that does call getaddrinfo. Still, those tools should self-configure based on the current list of nameservers, so you should be able to verify Ziti DNS with dig or drill in most cases, though a fast loop won't have the same failure modes as the glibc resolver, unless the troubleshooting tool (like dig) happens to be specially configured to emulate the glibc resolver.
Notably, dig did not manifest the problem, while drill did manifest the problem intermittently when hammering the nameserver, so I attribute that to differences in their internal algorithms for interpreting SERVFAIL, NXDOMAIN, a timeout, etc. they may have encountered when querying the Ziti nameserver.
After the pod is up as confirmed by the liveness probe, you continue to observe intermittent Ziti DNS failures where Homarr Server pod is unable to lookup auth.domain.com. Is that an accurate capture of the problem you're solving?
You confirmed the host entry points to an instance Traefik. I assume that instance has a host rule for auth.domain.com and so this test effectively bypasses Ziti, to isolate the cause, right?
When testing with Outline is the failure mode also accurately described as "you continue to observe intermittent Ziti DNS failures where the Outline pod is unable to lookup auth.domain.com?"
That is strange. The Homarr Next.js application cannot resolve auth.domain.com at all, not even intermittently, but cURL exec'd by the Kubelet inside the same container, Alpine w/ MUSL, can resolve the other ziti service address, alwaysup.domain.com, most of the time with intermittent failures signaled by the liveness probe.
I am sure the ziti tunnel sidecar log would show it discovering both Ziti services because you can resolve both addresses with dig and drill.
Does the ziti tunnel sidecar log contain any interesting error messages?