How durable is this fabric?

So… here’s the big question. How durable is OpenZITI for production use? I am not being critical here, I love open source, and I love the quality of engagement here in the forums. So, don’t take this the wrong way pls.

So, in my short time, I have found at least one bug in the underlying function and at least a couple in the UI/Docs (which were immediately acted on, kudos!). Is this common? Or am I just producing some chaos that helps identify flaws?

I worked on an at-scale k8s project where we called my activity 'JP-as-a-Service` and I was our chaos-monkey. So this isn’t a new experience for me :wink:
tenor-185768171

On the other hand, I am looking at business applications, and while I am not talking about critical infrastructure, I am talking about my mental health and that I can depend on things running and minimizing breaking changes. I know there is a commercial product as well, but the consideration of commercial products has a LOT more players.

That said, what kind of durability have people seen when putting OpenZITI to long-term production use? I looked at the git log for the Adopters page and there isn’t any activity older than a year.

1 Like

Hi @jptechnical
Going to try and give you as much nuance as I can.

CloudZiti is built on OpenZiti and delivers 100s of millions of sessions per year, including in many Fortune 50 environments. When I say it’s built on OpenZiti, CloudZiti runs OpenZiti releases, with no patches or extra features added. The value-add is in infrastructure management and monitoring as well as tools to aggregate and visualize the rich data that OpenZiti provides. I’m not trying to sell you CloudZiti here, just make it clear that the baseline software you’d be running is the same, while trying to answer the implicit question of why anyone would pay for CloudZiti :slight_smile:

The features which get used are generally very stable. There are some features, like addressable terminators, which haven’t seen much use yet and exist in an MVP state. In this case we built it for things like VoIP, P2P and reducing service bloat for things like SSH. As it gets more use, it will evolve based on how it’s used. When something does start to get used we do our best to jump in a fix things as quickly as possible (as you saw :slight_smile: )

From the perspective of being production ready, one thing I’m quite proud of is the amount of insight you can get into an OpenZiti environment. We’ve got lots of metrics, very robust utilization data, events for all kinds of things including entity change events for auditing.

In terms of product stability, we’re working very hard to maintain backward compatibility for the edge protocol, since if we break that it causes users significant pain. There are also enough users that we attempt to minimize breakage in controller to router or router to router communications. For example, the link management changes going into the next release offer both forwards and backwards compatibility between the controller and routers. However, until we reach 1.0, we’re still allowing some breaking changes to happen. We are holding off on 1.0 until HA is released and that should be happening relatively soon.

Let me know if that makes sense and if I can clarify anything.
Cheers,
Paul

3 Likes

Also, I appreciate the chaos you’re bringing :smiley: Everyone uses and approaches things differently and that helps refine everything in the project.

Cheers,
Paul

3 Likes

I’m not sure which sessions plorenz is referring to, but I think in terms of service dials, and in that case, Cloud Ziti delivered over 119M connections last week, carrying over 80TB in the last month from our top 20 networks. You have a lot of options to make production systems robust, additional fabric routers, HA ingress and egress, and soon, HA controllers. From experience, we know Cloud regions/AZ’s have issues that never show up on their status pages, so redundancy is important. You also need to make sure you’re following normal operational monitoring on both the system and the underlying servers. No software works when you peg the CPU, of course, or run out of memory. At scale, you’ll also need a monitoring system for the network’s operations to be able to troubleshoot and react. Prometheus is supported directly, and there are other options as well.

I’ll also echo the comment on the chaos. Bring it on, you’ve seen the reaction time and concern; we are 100% focused on making OpenZiti even stronger than it is today, and appreciate anyone that wants to help us in that journey

4 Likes

@gormami I wasn’t sure what was OK to post about customers and usage, so I’m glad you chimed in with an appropriate level of detail :+1: