ERROR transport/v2/tls.(*sharedListener).processConn [tls:0.0.0.0:1280]: {remote=[172.20.0.4:38382] error=[tls: client didn't provide a certificate]} handshake failed

If you do end up troubleshooting on Oracle, my particular vm is running Canonical Ubuntu 22.04 on their VM.Standard.A1.Flex shape base

Nice going. Those details will arm me for another troubleshooting pass. Now we know that it is not only the ziti-controller that cannot be reached, but also the zrok-controller cannot be reached by the zrok-frontend, so it's something not unique to ziti-controller.

I could not reproduce the problem with an arm64 Ubuntu Noble 24.04 instance on Oracle cloud.

Here is a boring aciinema recording I made when I tried to reproduce the problem with an arm64 EC2 instance, also Ubuntu Noble 24.04. It was too big to upload to the web player site so you will need to install it and run asciinema play https://n80x0r2fmjyr.bingnet.cloud/ken-ec2-arm64-zrok.json. I'll leave that zrok share available over the long US weekend (assuming my computer stays running).

The recording might show me doing something different. I'm not sure what it could be though.

Did you install regular ol' Docker on your arm64 VPS from these instructions?

...and I just noticed you clarified you were using Ubuntu Jammy 22.04, not Noble, and I'm pretty confident that the underlying problem is not stemming from that difference, but I've been wrong before.

FYI: the AWS session credentials in that recording are expired.

Thanks for the recording. Are you actually editing anything in this config (or any other than the .env)? It looks like asciinema gets all funny anytime you're in VIM

It is just a regular docker install yeah, 27.2.0

Some more info in case things help
If I go into ziti-quickstart with docker compose exec -it ziti-quickstart bash the user is [I have no name]

And for some reason I thought I included this, but this error is also in the quickstart logs:

 ERROR ziti/tunnel/dns.NewDnsServer: system resolver test failed: failed to resolve ziti-tunnel.resolver.test: lookup ziti-tunnel.resolver.test on 127.0.0.11:53: no such host
ziti-quickstart-1  |
ziti-quickstart-1  | ziti-tunnel runs an internal DNS server which must be first in the host's
ziti-quickstart-1  | resolver configuration. On systems that use NetManager/dhclient, this can
ziti-quickstart-1  | be achieved by adding the following to /etc/dhcp/dhclient.conf:
ziti-quickstart-1  |
ziti-quickstart-1  |     prepend domain-name-servers 127.0.0.1:53;

I did have that line added to my server's /etc/dhcp/dhclient.conf, but does it need to somewhere else? The logs mentioning 127.0.0.11:53 instead of 1:53 do seem a bit funny to me, but wondering if the logger is just typoing the extra 1, based on the later message below

I changed the Caddyfile's tls section to match the documentation from the route53 plugin which uses three values for the credential. No need to edit the Caddyfile if your DNS provider uses one API token value for the credential, which cloudflare does. I used the Cloudflare plugin before so I know it works.

This is harmless noise, not an error. There's no need to change your resolver configuration. We did confirm the zrok-controller was finding the correct container IP for the ziti-quickstart container.

Coincidentally, I fixed the bug that was causing this and it should be included in :latest which is :1.2.0. You may need to pull the image from Docker hub: docker compose pull.

I haven't been able to think of a condition that explains why it can look up the correct IP but gets a connection refused (RST) response when it attempts to connect to 1280/tcp.

That's disappointing. If you think it's helpful I'll narrate the recording I made for the Oracle arm64 install. Might do it anyway because I'm half expecting an aha moment anytime and it's going to be something simple, I bet.

Here's the raw asciinema from the Oracle arm64 install: https://n80x0r2fmjyr.bingnet.cloud/oracle-arm64-docker-zrok.json

Oh, man. I just tried to watch this one and something's definitely not right when in the editor. It's like chunks of the viewport are missing.

Is this info helpful at all?

$ docker inspect zrok-ziti-quickstart-1 | grep IPAddress
172.20.0.3 # ip is different after a wipe (or two, or ten, whatever, let's not dwell on this one)

$ docker compose exec -it zrok-controller bash

bash-5.1$ ziti edge login

Enter controller host[:port] (default localhost:1280): 172.20.0.3:1280

Untrusted certificate authority retrieved from server
Verified that server supplied certificates are trusted by server
Server supplied 2 certificates
Trust server provided certificate authority [Y/N]: Y

RESTY 2024/11/09 22:20:07 ERROR Get "https://172.20.0.3:1280/edge/client/v1/version": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, ::1, not 172.20.0.3, Attempt 1
... multiple repeats

So it seems zrok-controller can at least get a different error if I manually try the login; I only get the tcp dial issue when I only entered 172.20.0.3 w/o the port number, or when entering the default:

Connection error: Get https://localhost:1280/.well-known/est/cacerts: dial tcp [::1]:1280: connect: connection refused

I suppose that makes it all the more puzzling as to why the logs still show this error below, and not a different one. Though I imagine you already knew that from some of the other diagnostics above.

$ docker compose logs zrok-controller
zrok-controller-1  | {"file":"/home/runner/work/zrok/zrok/controller/bootstrap.go:34","func":"github.com/openziti/zrok/controller.Bootstrap","level":"info","msg":"connecting to the ziti edge management api","time":"2024-11-09T03:08:25.889Z"}
zrok-controller-1  | panic: error connecting to the ziti edge management api: Get "https://ziti.zrok.[mydomain].dev:1280/edge/management/v1/.well-known/est/cacerts": dial tcp 172.20.0.3:1280: connect: connection refused

So idk if I missed something in either the tutorial videos or the docs, but my aha moment was what if it's just a race condition, and the services aren't auto retrying. So restarting the controller and frontend seem to have done the trick (previously I'd always restarted all 3 at the same time not realizing the quickstart errors were non issues). Thanks for the help!

I've submitted a PR in case it's something that will help others in the future.

1 Like

Declaring that dependency to order the container startup is a good call. Thank you. I had naively thought the restart: unless-stopped policy on zrok-controller and zrok-frontend would lead to them eventually starting up, but you may have discovered a condition where that's untrue.