Hello! Here are some notes from my work with OpenZiti over the past few days. These include a mix of observations, pain points, ideas, wish-list-things, questions. I'll keep these points as short as possible to not bloat this too much.
Update -- this post has gotten much longer than intended and is now truly bloated! It's also early Friday evening. I had tried to get time earlier this week to write this up, but it only happened now, so apologies in advance for dropping this large post going into a weekend! I'll be AFK for the weekend, but look forward to catching up early next week.
Versioning:
- ziti-edge-tunneler referenced as ZET below. Latest versions of ZET and ziti are being used: 0.20.6 and 0.26.10, respectively.
Pain points or sources of confusion:
-
On systems with no systemd-resolved, and either resolvconf or plain /etc/resolv.conf usage, ZET will inject its DNS IP address as
nameserver $ZET_DNS_IP
into /etc/resolv.conf either directly with sed or via resolvconf. However, when the ZET service is stopped, there is no shutdown/cleanup hook to remove this no longer working ZET nameserver which results in the degraded system performance due to long default failover times to secondary nameserver which exceed several other default software timeouts for nslookups. Minor thing and we're working around this with a systemd post stop service hook. -
If a simple service is set up and working with ZET on both ends of host and intercept, using IP and port interception for both host.v1 and intercept.v1 config types, then if the port type is changed for both host.v1 and intercept.v1 (like changing port interception and hosting from 2222 to 2223), the change does not get automatically propagated. It seems that the dial ZET needs to be restarted for the change to take effect. Then on dialing the service to the host ZET the first time, the connection fails after which the hosting ZET side appears to update itself with the proper config and a second dial attempt succeeds. I believe this behavior is also similar for IP changes as well. Possibly host names too but I haven't checked that.
- Similar to above I see I made a note that renaming of identity could cause this as well although again. I also noted that changing a service from "AllOf" semantic, which had the service not working since the AllOf requirement was not fulfilled, to "AnyOf" semantic, which should have started the service working since the AnyOf requirement was fulfilled, also created some issues with me needing to poke around restarting things until the service started behaving as expected.
-
There is some config schema drift between the master commit of ziti-console and what the actual schema is of the latest ziti and ZET binaries. For example, host.v1 options for
"listenOptions": {"identity":"<PARAM>"}}
isn't shown in the config UI, and if you select the "JSON" UI element to edit JSON manually, you can add the parameter, but upon saving, it will not be there. CLI config creation is required in cases like this. Not a big deal, but confusing until one realizes what is happening.
DNS related:
-
The
--dns-upstream
option of ZETrun
command works, but is not shown in the CLI help output. I happened to come across mention of it in the tunneler client docs. -
The tproxy option for a ziti-router server tests for primary resolver access to DNS and fails if it doesn't have it. Why does it need this? The client tunneler documentation here describes
ziti-tunnel
having similar requirement to be the primary resolver but also mentions that:"... The nameserver will only answer queries for which it is authoritative i.e. OpenZiti Services' domain names, and so you will also need a secondary, recursive resolver."
In the case of ziti-tunnel which I haven't tried, I assume the secondary recursive resolver can be set with the
--resolver
flag. However, I don't see any similar option in the ziti-router command line or config options when running in tproxy mode. Did I miss seeing how to set that config somewhere for ziti-router? -
@scareything I had gotten off topic from the DNS thread I started previously, so will go into that a bit more here and why I think it would be good to have support to disable auto-integration of the ZET DNS resolver. And, when I say disable, I do mean to still have it served from a predetermined binding address and port, but just not to inject it into resolved, resolvconf or directly into /etc/resolv.conf automatically.
-
First, let me say I do realize DNS handling is a tricky issue; Tailscale has written quite a few articles on the complexity around this and the support headaches than can ensue. Here's an example you've probably already read.
-
Our environments have a number of different DNS setups. Commonly, we're doing non-trivial DNS splitting and using some other resolver features which we typically leverage dnsmasq for; here's an example. In some other cases we're using kresd, in some others resolvconf. We are not using systemd-resolved at this time on most systems. There has been quite a bit of testing and tweaking in the various DNS components of the system and stack to get everything working nicely together.
-
ZET inserts itself as primary resolver and passes requests outside of its record set upstream. We could run ZET and send non-ziti DNS queries to the upstream of our dnsmasq binding, for example. But we haven't tested ZET much to know how well it does under load and if ZET DNS service breaks or stops, then so does everything else behind it because dnsmasq is no longer primary; it may still work, but will have degraded performance due to the slow failover to secondaries under direct resolv or resolvconf control. I would much rather keep our known and tested config as primary, and DNS split a
.ziti
suffix (or whatever else we need) out to ZET as the upstream for that suffix. That way, if ZET fails, only ZET is affected and the blast radius is reduced. One example which got me thinking more about this was a simpledig +trace $FQDN
query which had ZET issue a DNS warning:WARN tunnel-cbs:ziti_dns.c:759 on_upstream_packet() unexpected DNS response: too large
And so I'm wondering what other odd DNS things or load patterns our systems might do that could possibly cause problems with ZET as primary feeding our usual config as secondary rather than the other way around. Maybe there won't be problems with this way, but I'd be much more confident about not having new problems if we can just keep our usual config as primary.
-
I've done a fair amount of reading in the past week on systemd-resolved. It has a lot of nice features, including split DNS and is probably something for us to start working with more, but one downside it appears to have for our use case at the moment is that other applications can modify DNS behavior through the shared API/D-bus. For us, we really don't want other apps potentially changing DNS behavior at all, so dnsmasq with a locked down and declared config still seems preferable. Even if we do change over to resolved use on majority of systems, that will still take time and testing to make sure the new service works well with our existing stack, and so in the meantime, I'd still prefer to have the option of configuring ZET as a split DNS being utilized by dnsmasq as an upstream, with dnsmasq as primary resolver.
-
Our job scheduling framework sets up containers with custom DNS resolv.conf files so that name resolution across different network bridges and namespaces continues to work. If we add Ziti to a containerized jobset and it injects itself into config files as primary resolver, it seems reasonable to think things might break. I haven't looked much into the details of this or tested this to know for sure yet, so maybe I should trust ZET more here that everything will just work out ok :). IIRC, I think I saw a GH issue recently which addressed this very issue. But my point is that we have a lot of complexity in a lot of different ways and if ZET allows us to determine when and how we use ZET DNS, then I'm much more confident we can work through any problems that might come up as we integrate it into our stack in various ways.
-
Questions:
-
When the
--verbose N
CLI option is turned up with ZET to show trace logging, the dial side shows on-going trace logging every few seconds, and the host side doesn't really show much different at all; only the very first line of output on startup is a trace line, and the rest is the same with no on-going logging. Similarly the hosting ZET does not show any logging above INFO when connections are made. I'm guessing this is expected, but I also seeINFO ziti_log_set_level set log level: root=3
in the output regardless of what theverbose
level is set to, so wanted to confirm this is intended behavior and not a logging bug for hosted services on ZET. -
When doing an
scp
file transfer via a ZET to ZET IP intercept.v1 to host.v1 config type service, the hosting ZET logged about 100 lines, each within about 3 ms of each other, all with the same message. This is repeatable with each scp file transfer. Is this expected? The file sha256 hashes are identical, so there is no transfer corruption. But I'm wondering if the logging is indicative of a functional problem we might hit with load being a bit higher. Transfer bandwidth in this case is about 90 Mbps. Log lines that repeat many times are:WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
-
Is the backup/restore procedure for all components (including the non-HA controller which is currently boltDB storage) simply to backup the state directory and restore it prior to service restart? That's how it looks to me, but I haven't tested that yet.
-
Are there any out of band JWT distribution ideas that people find particularly handy? We use Hashicorp Vault already and a Ziti plugin which can then issue appropriate JWT identities to systems already authenticated sounds like it might be a nice thing to help with this.
-
Is there a garbage collection mechanism for identities? One use case we may have in the near future is for ZET as a sidecar for various CI jobs. These jobs are run in significant number and the assumption here is that we would have secure enrollment automation already taken care of... But on the flip side, we would then have an accumulation of identities to deal with. Perhaps in the case of short lived jobs, an identity being issued with a time-to-live TTL would take care of garbage collection. Alternatively, perhaps there could be a set of identities which are re-used. There the challenge may be secure allocation of identities to the jobs with no concurrent identity usage.
-
The createPki ziti-cli-function creates certs with 10 year expiration dates by default as an example. Are there any docs or good references for Ziti cert rotation?
Other thoughts and ideas:
-
ZET will log incoming connections for hosted services with the name of the hosted service and the client identity name. Could the name of the Ziti Network also be added to the log line? This would help identify the source of traffic when a ZET host is hosting services across several Ziti networks because it's a shared resource, especially as these networks start to scale up.
-
I like the
identity-dir
option for ZET. I have a service prestart script to auto enroll and then clean up any jwt files found in the identity dir during startup which is handy. I checked SIGHUP handling to see if a SIGHUP would have ZET auto-reload identities in the identity dir in case new identities become available so that existing services which may be under load don't need to be broken by completely stopping ZET and restarting it to add new identities. Or instead, maybe an identity refresh polling interval could be done for this, similar to the ----refresh N
option for service polling.