Misc notes after a few days of working with ziti

Hello! Here are some notes from my work with OpenZiti over the past few days. These include a mix of observations, pain points, ideas, wish-list-things, questions. I'll keep these points as short as possible to not bloat this too much.

Update -- this post has gotten much longer than intended and is now truly bloated! It's also early Friday evening. I had tried to get time earlier this week to write this up, but it only happened now, so apologies in advance for dropping this large post going into a weekend! I'll be AFK for the weekend, but look forward to catching up early next week.

Versioning:

  • ziti-edge-tunneler referenced as ZET below. Latest versions of ZET and ziti are being used: 0.20.6 and 0.26.10, respectively.

Pain points or sources of confusion:

  • On systems with no systemd-resolved, and either resolvconf or plain /etc/resolv.conf usage, ZET will inject its DNS IP address as nameserver $ZET_DNS_IP into /etc/resolv.conf either directly with sed or via resolvconf. However, when the ZET service is stopped, there is no shutdown/cleanup hook to remove this no longer working ZET nameserver which results in the degraded system performance due to long default failover times to secondary nameserver which exceed several other default software timeouts for nslookups. Minor thing and we're working around this with a systemd post stop service hook.

  • If a simple service is set up and working with ZET on both ends of host and intercept, using IP and port interception for both host.v1 and intercept.v1 config types, then if the port type is changed for both host.v1 and intercept.v1 (like changing port interception and hosting from 2222 to 2223), the change does not get automatically propagated. It seems that the dial ZET needs to be restarted for the change to take effect. Then on dialing the service to the host ZET the first time, the connection fails after which the hosting ZET side appears to update itself with the proper config and a second dial attempt succeeds. I believe this behavior is also similar for IP changes as well. Possibly host names too but I haven't checked that.

    • Similar to above I see I made a note that renaming of identity could cause this as well although again. I also noted that changing a service from "AllOf" semantic, which had the service not working since the AllOf requirement was not fulfilled, to "AnyOf" semantic, which should have started the service working since the AnyOf requirement was fulfilled, also created some issues with me needing to poke around restarting things until the service started behaving as expected.
  • There is some config schema drift between the master commit of ziti-console and what the actual schema is of the latest ziti and ZET binaries. For example, host.v1 options for "listenOptions": {"identity":"<PARAM>"}} isn't shown in the config UI, and if you select the "JSON" UI element to edit JSON manually, you can add the parameter, but upon saving, it will not be there. CLI config creation is required in cases like this. Not a big deal, but confusing until one realizes what is happening.

DNS related:

  • The --dns-upstream option of ZET run command works, but is not shown in the CLI help output. I happened to come across mention of it in the tunneler client docs.

  • The tproxy option for a ziti-router server tests for primary resolver access to DNS and fails if it doesn't have it. Why does it need this? The client tunneler documentation here describes ziti-tunnel having similar requirement to be the primary resolver but also mentions that:

    "... The nameserver will only answer queries for which it is authoritative i.e. OpenZiti Services' domain names, and so you will also need a secondary, recursive resolver."

    In the case of ziti-tunnel which I haven't tried, I assume the secondary recursive resolver can be set with the --resolver flag. However, I don't see any similar option in the ziti-router command line or config options when running in tproxy mode. Did I miss seeing how to set that config somewhere for ziti-router?

  • @scareything I had gotten off topic from the DNS thread I started previously, so will go into that a bit more here and why I think it would be good to have support to disable auto-integration of the ZET DNS resolver. And, when I say disable, I do mean to still have it served from a predetermined binding address and port, but just not to inject it into resolved, resolvconf or directly into /etc/resolv.conf automatically.

    • First, let me say I do realize DNS handling is a tricky issue; Tailscale has written quite a few articles on the complexity around this and the support headaches than can ensue. Here's an example you've probably already read.

    • Our environments have a number of different DNS setups. Commonly, we're doing non-trivial DNS splitting and using some other resolver features which we typically leverage dnsmasq for; here's an example. In some other cases we're using kresd, in some others resolvconf. We are not using systemd-resolved at this time on most systems. There has been quite a bit of testing and tweaking in the various DNS components of the system and stack to get everything working nicely together.

    • ZET inserts itself as primary resolver and passes requests outside of its record set upstream. We could run ZET and send non-ziti DNS queries to the upstream of our dnsmasq binding, for example. But we haven't tested ZET much to know how well it does under load and if ZET DNS service breaks or stops, then so does everything else behind it because dnsmasq is no longer primary; it may still work, but will have degraded performance due to the slow failover to secondaries under direct resolv or resolvconf control. I would much rather keep our known and tested config as primary, and DNS split a .ziti suffix (or whatever else we need) out to ZET as the upstream for that suffix. That way, if ZET fails, only ZET is affected and the blast radius is reduced. One example which got me thinking more about this was a simple dig +trace $FQDN query which had ZET issue a DNS warning:

      WARN tunnel-cbs:ziti_dns.c:759 on_upstream_packet() unexpected DNS response: too large
      

      And so I'm wondering what other odd DNS things or load patterns our systems might do that could possibly cause problems with ZET as primary feeding our usual config as secondary rather than the other way around. Maybe there won't be problems with this way, but I'd be much more confident about not having new problems if we can just keep our usual config as primary.

    • I've done a fair amount of reading in the past week on systemd-resolved. It has a lot of nice features, including split DNS and is probably something for us to start working with more, but one downside it appears to have for our use case at the moment is that other applications can modify DNS behavior through the shared API/D-bus. For us, we really don't want other apps potentially changing DNS behavior at all, so dnsmasq with a locked down and declared config still seems preferable. Even if we do change over to resolved use on majority of systems, that will still take time and testing to make sure the new service works well with our existing stack, and so in the meantime, I'd still prefer to have the option of configuring ZET as a split DNS being utilized by dnsmasq as an upstream, with dnsmasq as primary resolver.

    • Our job scheduling framework sets up containers with custom DNS resolv.conf files so that name resolution across different network bridges and namespaces continues to work. If we add Ziti to a containerized jobset and it injects itself into config files as primary resolver, it seems reasonable to think things might break. I haven't looked much into the details of this or tested this to know for sure yet, so maybe I should trust ZET more here that everything will just work out ok :). IIRC, I think I saw a GH issue recently which addressed this very issue. But my point is that we have a lot of complexity in a lot of different ways and if ZET allows us to determine when and how we use ZET DNS, then I'm much more confident we can work through any problems that might come up as we integrate it into our stack in various ways.

Questions:

  • When the --verbose N CLI option is turned up with ZET to show trace logging, the dial side shows on-going trace logging every few seconds, and the host side doesn't really show much different at all; only the very first line of output on startup is a trace line, and the rest is the same with no on-going logging. Similarly the hosting ZET does not show any logging above INFO when connections are made. I'm guessing this is expected, but I also see INFO ziti_log_set_level set log level: root=3 in the output regardless of what the verbose level is set to, so wanted to confirm this is intended behavior and not a logging bug for hosted services on ZET.

  • When doing an scp file transfer via a ZET to ZET IP intercept.v1 to host.v1 config type service, the hosting ZET logged about 100 lines, each within about 3 ms of each other, all with the same message. This is repeatable with each scp file transfer. Is this expected? The file sha256 hashes are identical, so there is no transfer corruption. But I'm wondering if the logging is indicative of a functional problem we might hit with load being a bit higher. Transfer bandwidth in this case is about 90 Mbps. Log lines that repeat many times are:

    WARN ziti-sdk:channel.c:815 channel_alloc_cb() can't alloc message
    
  • Is the backup/restore procedure for all components (including the non-HA controller which is currently boltDB storage) simply to backup the state directory and restore it prior to service restart? That's how it looks to me, but I haven't tested that yet.

  • Are there any out of band JWT distribution ideas that people find particularly handy? We use Hashicorp Vault already and a Ziti plugin which can then issue appropriate JWT identities to systems already authenticated sounds like it might be a nice thing to help with this.

  • Is there a garbage collection mechanism for identities? One use case we may have in the near future is for ZET as a sidecar for various CI jobs. These jobs are run in significant number and the assumption here is that we would have secure enrollment automation already taken care of... But on the flip side, we would then have an accumulation of identities to deal with. Perhaps in the case of short lived jobs, an identity being issued with a time-to-live TTL would take care of garbage collection. Alternatively, perhaps there could be a set of identities which are re-used. There the challenge may be secure allocation of identities to the jobs with no concurrent identity usage.

  • The createPki ziti-cli-function creates certs with 10 year expiration dates by default as an example. Are there any docs or good references for Ziti cert rotation?

Other thoughts and ideas:

  • ZET will log incoming connections for hosted services with the name of the hosted service and the client identity name. Could the name of the Ziti Network also be added to the log line? This would help identify the source of traffic when a ZET host is hosting services across several Ziti networks because it's a shared resource, especially as these networks start to scale up.

  • I like the identity-dir option for ZET. I have a service prestart script to auto enroll and then clean up any jwt files found in the identity dir during startup which is handy. I checked SIGHUP handling to see if a SIGHUP would have ZET auto-reload identities in the identity dir in case new identities become available so that existing services which may be under load don't need to be broken by completely stopping ZET and restarting it to add new identities. Or instead, maybe an identity refresh polling interval could be done for this, similar to the ----refresh N option for service polling.

2 Likes

John - this is some incredible feedback - really appreciated. Our team will clearly be digging into this data.

Yours in code
Mike Kochanik

2 Likes

Hey John,

Thanks again for your valuable feedback here. I'll defer to maintainers to address the majority of the issues, but I wanted to chime in on a few of your points.

So we also take the approach of using an ExecStartPre script in the system unit distributed through the distribution packages. These packages, by default, expect identities to be installed to a directory (which can be overridden by the user). JWT tokens installed to this directory are automatically enrolled. You can see the source for these: systemd unit and ExecStartPre. Since you're providing what one day may become the de-facto way of running the ZET on NixOS, it would be great if these aligned with yours (which they do seem roughly equivalent).

On reloading from SIGHUP (or maybe SIGUSR1), or refresh interval, I think this is a valid enhancement request to the project.

I would raise an issue, treating this as a defect which asserts the ZET should clean up after itself. Raising this issue could result in better handling in the application (where it might correctly belong), but could also similarly be done in packaging. This is another place where alignment of behavior would be great between the NixOS package and the rest of the packaging formats. I'd encourage you to either submit a PR to packaging or raise the issue so it can be discussed amongst maintainers to determine the best course of action.

So, this points has two problems, both having solutions in configuration management which I believe are out of scope for the ZET source. First, D-Bus actually defines policy regarding what can communicate with a bus name and interface. You can also configure this through PolKit. You can find more information for that here, and here respectively. In this case, this would mean a request to the systemd maintainers to tighten access to the systemd-resolved interfaces by providing these policies. Second, the system administrator can supply these policies as well, and or ensure consistent layers in the applications they deploy and manage on systems. We should only address the case where the ZET does not layer sufficiently well, but not take on the responsibility of the administrator to ensure a fully consistent environment in this way.

The ZET really has two personalities when operating inside a container. The recent issue on GH (which is resolved now, my apologies!) related to the mixed mode operation. If you mount the DBus socket from the host to the container, the ZET will configure the host through libsystemd or the available systemd-resolved binaries (these also work over the DBus socket). When these interfaces are not detected, or available through a container which boots sytemd, a ZET running in a container can still configure the local interfaces. The only requirement here is that the ZET run last, I suppose, so that it can be the first resolver.

Depending on your use case, ensuring the ZET runs last might be sufficient; however, another enhancement request for the ability to disable DNS auto-configuration is valid. I suppose the concern here is around then having to field issues where users unintentionally break Ziti DNS resolution, but the flexibility may be worth the cost.

Thanks again John for your thoroughness.

  • Steven
2 Likes

Hi Steven!

So we also take the approach of using an ExecStartPre script in the system unit distributed through the distribution packages. These packages, by default, expect identities to be installed to a directory (which can be overridden by the user). JWT tokens installed to this directory are automatically enrolled. You can see the source for these: systemd unit and ExecStartPre. Since you’re providing what one day may become the de-facto way of running the ZET on NixOS, it would be great if these aligned with yours (which they do seem roughly equivalent).

Ah, ok. I didn't notice those because of using just patched binaries so far. Sure, moving forward when we get to work on full NixOS Ziti packaging, we'll work to align with Ziti's official service definitions as much as possible.

On reloading from SIGHUP (or maybe SIGUSR1), or refresh interval, I think this is a valid enhancement request to the project.

Submitted: GH#535

I would raise an issue, treating this as a defect which asserts the ZET should clean up after itself. Raising this issue could result in better handling in the application (where it might correctly belong), but could also similarly be done in packaging. This is another place where alignment of behavior would be great between the NixOS package and the rest of the packaging formats. I’d encourage you to either submit a PR to packaging or raise the issue so it can be discussed amongst maintainers to determine the best course of action.

Submitted as an issue rather than a PR as I'll likely have difficulty finding time beyond what I'm already allocating to Ziti at the moment: GH#536. I noted preference for a solution in the app itself as the app (binary) won't always be run in the context of systemd/packaging.

Re: systemd-resolved:

So, this points has two problems, both having solutions in configuration management which I believe are out of scope for the ZET source. ... [snip] ...

Ah, ok, thanks for the info. So I have some further reading on resolved now with your references, and if or when we decide to migrate to resolved as our resolver as an alternative to how we are resolving now, based on this, looks like we may have a way to accomplish the set up we'd like.

The ZET really has two personalities when operating inside a container. The recent issue on GH (which is resolved now, my apologies!) related to the mixed mode operation. If you mount the DBus socket from the host to the container, the ZET will configure the host through libsystemd or the available systemd-resolved binaries (these also work over the DBus socket). When these interfaces are not detected, or available through a container which boots sytemd, a ZET running in a container can still configure the local interfaces. The only requirement here is that the ZET run last, I suppose, so that it can be the first resolver.

I think this means that for environments with non-resolved containers (and hosts) which is mostly our case, DNS auto-injection into resolv.conf will still happen. And given the resolved polkit references linked above, we could probably migrate to resolved and get everything working as we'd like across our environments, but realistically, due to other pressing deadlines, we probably won't be doing that soon. That leaves us with DNS auto-injection and ziti becoming the primary resolver always, which for the reasons in the initial thread above, we'd really rather not do. I've raised an FR at GH#537, and will accept whatever the outcome is. Thank you for your feedback! :slight_smile:

Two other unrelated clarifications to the initial post above:

  • On the question of if the verbosity option (--verbose N) on the host side may have a bug: while I was filing the GH tickets above, just noticed that @qrkourier filed an issue which may explain this observation about a day and a half before I posted: GH531

  • On the pain-point/confusion note around changing service properties (port, IP, identity names, semantic options) of an already live and running service and those changes not appearing to propagate to the ZET endpoints without some restarting of the endpoints: upon re-reading those two paragraphs this morning, the description sounds a bit confusing and I can no longer edit that post, so a couple clarifications:

    • I reproduced that issue for the port change case on two different days after starting again from clean state to be sure it wasn't just a transient issue. It might still be a problem just with my particularr configuration for some reason, but it is repeatable.
    • Although I encountered and made note of the other change types having similar propagation issues (IP, identity names (I mentioned host names in the initial post, but meant identity names), semantic options), I did not have the time to reproduce the other cases as well to make sure the issue is repeatable and not transient.