I think quickstart:0.30.1 broke something in the edge-tunnel connections.
I backed up my docker compose volumes and rebuilt with the 0.30.1 as suggested, but I could not get even the most basic service to work. I hammered at it for HOURS... no joy. For a lark, I redirected my docker volumes and started my old one up to see if it worked. I had to re-enroll some identities, but I got it back and it was NOT working. So I tried specifying the quickstart:0.30.0 image and WHAMMY... it worked. So I changed it back again to the 0.30.1 and once again, no joy.
Here is the tunneler output on the 0.30.1 on a linux host that is trying to ssh into an identity.
Aug 25 20:54:27 pve1 systemd[1]: Started ziti-edge-tunnel.service - Ziti Edge Tunnel.
... ignoring my DNS errors unless you want to see it.
Aug 25 20:55:08 pve1 ziti-edge-tunnel[213980]: (213980)[ 40.745] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3008/unknown node or service]
Aug 25 20:55:22 pve1 ziti-edge-tunnel[213980]: (213980)[ 55.298] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3001/temporary failure]
Aug 25 20:55:50 pve1 ziti-edge-tunnel[213980]: (213980)[ 83.101] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3008/unknown node or service]
Aug 25 20:56:25 pve1 ziti-edge-tunnel[213980]: (213980)[ 118.466] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3001/temporary failure]
Aug 25 20:57:53 pve1 ziti-edge-tunnel[213980]: (213980)[ 205.899] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3001/temporary failure]
Aug 25 20:59:37 pve1 ziti-edge-tunnel[213980]: (213980)[ 310.089] ERROR ziti-sdk:channel.c:860 on_channel_connect_internal() ch[0] failed to connect to ER[ziti.mydomain.com] [-3001/temporary failure]
Aug 25 20:59:37 pve1 ziti-edge-tunnel[213980]: (213980)[ 310.089] ERROR ziti-sdk:connect.c:281 on_channel_connected() ztx[0] ch[0] failed to connect [-3001/temporary failure]
Aug 25 20:59:47 pve1 ziti-edge-tunnel[213980]: (213980)[ 320.086] WARN ziti-sdk:connect.c:332 connect_timeout() conn[0.2/Connecting] connect timeout: no suitable edge router
Aug 25 20:59:47 pve1 ziti-edge-tunnel[213980]: (213980)[ 320.086] ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: operation did not complete in time
Output with the tunneler back on 0.30.0 there is nothing of interest after it starts... services just work.
Aug 25 21:02:40 pve1 systemd[1]: Started ziti-edge-tunnel.service - Ziti Edge Tunnel.
... ignoring my DNS errors unless you want to see it.
nothing from here out... just works
Here is what I get if I try to connect to a device I haven't reenroled. The identity exists but I am guessing this error is that it can't find it to connect.
Aug 25 21:07:25 pve1 ziti-edge-tunnel[227246]: (227246)[ 284.076] ERROR ziti-sdk:connect.c:919 connect_reply_cb() conn[0.4/Connecting] failed to connect, reason=service 5PTrC1jpZkVtrrP8wat5fH has no terminators for instanceId pve2.jp
Aug 25 21:07:25 pve1 ziti-edge-tunnel[227246]: (227246)[ 284.076] ERROR tunnel-cbs:ziti_tunnel_cbs.c:103 on_ziti_connect() ziti dial failed: connection is closed
This is strange. I previously tested quickstart 0.30.0 startup successfully after it dropped, but now I'm getting a ctrl plane certificate error for both quickstart 0.30.0 and 0.30.1. The problem seems to be the controller's ctrl plane TLS server is not presenting any certificates at all, typically 6262/tcp.
@TheLumberjack@berrabe FYI something's not right with the two latest quickstart releases.
I used the cert chains checker script to diagnose the problem with the ctrl plane listener's cert chain.
INFO: backing up /persistent/pki/cas.pem to /persistent/pki/cas.pem.20230826195706.bak
INFO: backing up /persistent/ziti-controller.yaml to /persistent/ziti-controller.yaml.20230826195706.bak
The following changes were made:
* Rebuilt the controller CA bundle
* Replaced the controller edge intermediate CA cert in the client API's web listener identity.ca with the controller edge root CA cert
Please restart the controller and all the main router and re-run this script without --rebuild.
Verify Ziti control and data planes are functioning:
$ go test ./quickstart/test/quickstart_test.go
ok command-line-arguments 10.621s
This does not repair quickstart >0.29.0 because, in newer versions, the controller's ctrl plane TLS server doesn't present any certs. I haven't been able to diagnose that one yet.
Hi @jptechnical, yes there was a change in 0.30.1 that changed environment variables. A new docker compose file was pushed too. It comes down to a problem with ZITI_ROUTER_ADVERTISED_ADDRESS and ZITI_ROUTER_ADVERTISED_HOST. If you add BOTH fields to the environment section of the docker-compose file, and BOTH fields to the .env file you can flop back and forth like you're trying to do:
# OpenZiti Variables
ZITI_IMAGE=openziti/quickstart
ZITI_VERSION=0.30.1
# the user and password to use
# Leave password blank to have a unique value generated or set the password explicitly
ZITI_USER=admin
ZITI_PWD=admin
# controller name, address/port information
ZITI_CTRL_NAME=ziti-controller
ZITI_CTRL_EDGE_ADVERTISED_ADDRESS=ctrl.home.pi
ZITI_CTRL_ADVERTISED_ADDRESS=ctrl.home.pi
ZITI_CTRL_ADVERTISED_HOST=ctrl.home.pi
#ZITI_CTRL_EDGE_IP_OVERRIDE=10.10.10.10
ZITI_CTRL_EDGE_ADVERTISED_PORT=8441
ZITI_CTRL_ADVERTISED_PORT=8440
# The duration of the enrollment period (in minutes), default if not set. shown - 7days
ZITI_EDGE_IDENTITY_ENROLLMENT_DURATION=10080
ZITI_ROUTER_ENROLLMENT_DURATION=10080
# router address/port information
#ZITI_ROUTER_NAME=ziti-edge-router
ZITI_ROUTER_ADVERTISED_ADDRESS=er.home.pi
ZITI_ROUTER_ADVERTISED_PORT=er.home.pi
ZITI_ROUTER_PORT=8442
#ZITI_ROUTER_IP_OVERRIDE=10.10.10.10
ZITI_ROUTER_LISTENER_BIND_PORT=8444
#ZITI_ROUTER_ROLES=public
With that compose file, I was able to successfully use zssh to zssh back to my own machine... So I'm pretty sure that's what happened here.
I don't expect any more changes like this for a long time... I think we're through the bumpy change period.
One last question then related to this, once I rebuild my resources using the new clean install from the quick start, is there any reason I can't use this instance in a small production workload?
To clarify, I'm looking to use this on my internal network and a couple of small point-to-point pairings out in the wild to be able to get some real time experience with it. But what I don't want to do is run into another spot where I need to rebuild it or do a major overhaul to the certificates or anything like that.
I'll be honest, this little weekend of issues force me to evaluate some other options even though I really didn't want to. I know there is stability in the commercial product, but if I'm really looking for a self-hosted solution and the commercial product for cases where I need that extra support that I can't provide on my own.
Certainly no reason. While the original goal of the quickstart was, and remains, an educational vehicle, the defaults are all sensible and inline with best practices. The PKI is arguably overly complex, but in practice plenty of people run the quickstart in their production setups. Also note that the quickstart's main benefit is in generating the PKI and initial configuration and these are things that in practice, change little. I don't see any problems at all.
Thanks for your candid feedback. I'll be honest, I don't blame you. Hopefully, through our responsiveness, we've earned your trust that even when a rocky patch is hit, the team is responsive and fixes problems quickly. I do honestly believe that there have been very few "rough patches" like this with our quickstarts in the past. Not to say there have been none, it does still happen, but generally, they are quickly resolved. We're also undergoing an effort to add even more testing around these sorts of things so that we learn from our past and don't repeat mistakes.
Thanks @TheLumberjack I have been super-wowed by the responsiveness. The timing was just not great... I was ready to start dogfooding it and had an immediate need for a really simple p2p rdp connection and had it all setup and half-deployed.
In the end, the extra effort was worth it... because I did get to look at some side-by-side comparison with a couple of wireguard setups... and while they were vastly more simple... I started bumping into the walls way faster than I thought. Ziti is just a greatly more capable solution.
Anyway, thanks for the transparency and kindness with some ranting. You continue to deliver.
Thank you for the update. I will test this later today.
I like your great support. Thank you for that.
Also I had a great learning curve the last days. Helped me getting more knowlegde.
Maybe two solutions for more transparency:
Having a channel where critcal changes are announced (Quick Start Version 0.30.1 needs tunnel Version 0.22.6, env names changed, ...). This will help updating the system whitout crashing.
use a differnt container for pki, use a init container for pki, router and controller stuff. Best via ansible, so that I would be able to do the init from outside the environment.
have a page which shows compatible vertions. (tunnel 0.226 will work with ctrl 0.30.1, ...)
Thanks for the feedback. We appreciate when people take the time to provide it!
Thanks. I tried to do that via Discourse itself with a "psa" but sounds like that missed the mark. I'd be interested in hearing what sort of ideas you have for this sort of communication.
That's something to consider. thx
I don't think we state it anywhere but we do strive for full backward compatibility of older clients with newer networks. Older SDKs/tunnelers shouldn't need to be touched when upgrading a network. I'll bring this comment to the team and see if we can codify that stance somewhere.
The PSA definitely got our attention. So that was a win. So there is a good opportunity to iterate on that.
I think where it fell short was that it gave a scenario, and a script, and possible fix. However, I believe we would do better with details on what is the issue, what you are looking for, and how to fix it. If you started with a TLDR; and a birds-eye view of the problem and solution (least effort), then that will give enough info to decide if this affects me or not and how much I want to dig in.
As regards the script, you definitely have a specific style of scripting that I haven't seen much. It's not a criticism, I love seeing other people's code, I learned a lot already by looking up some of your switches to common commands. So, no criticism intended, but it wouldn't hurt to be more verbose, make commands more long-form or briefly introduce the intent.
I know all of this means more work... but in the end I think it will save you more time.
I looked into it. The best I could do is make a "category" named something like "Important News" or "Public Notice" or ... something along those lines. Clearly, "PSA" is too "english-centric" (perhaps too US-centric even), that's my bad entirely. Something less-easily confused is needed.
I'll end up making a topic if needed. I have pinned topics in the past (thought I pinned this one?) but I like the idea of a category that people can opt into for important news/notifications/messages.
And add a link to the dicrource category to the Quickstart or Troubleshooting documentation Start Cooking With Ziti | OpenZiti
I believe thats the starting point for a lot of people and so the know where to subscribe.