The need for backup is different for identities, routers, and controllers. For all three, backup does not inherently cause it to become unavailable, e.g., database snapshot is not disruptive.
controller is described in the backup guide and includes the database and root certificate authority, at least.
router: it's not strictly necessary to back up a router because the same router ID can be re-enrolled by the admin if the host is lost or compromised
identity: like routers, identities can be re-enrolled by the admin if lost or compromised, so it's not strictly necessary to back up the identity file(s).
I admit it might make sense in some cases to back up a router's or identity's files, e.g., requesting a new enrollment from the admin is inconvenient.
I mean like will the routers and identities still be enrolled and function normally to route to services they were told to or we have to redo the steps of adding routers again after backup of controller ?
I suggest that you check out the controller backup guide sections I mentioned about database snapshot and root CA (PKI).
Creating a backup (snapshot) of the controller's database makes a copy of the data on the disk that you can then move to another device for safekeeping. That way, if you lose the database, you have the option to restore the data.
Ya understood , just confirming the controller's data is about its connection and registrations made across with different routers and hence if we back it up and run on even the different machine with different IP , the existing routers will connect to the new controller and controller knows how to route traffic to them
The controller's database has the enrollments, yes, as well as all entities' state, including policies.
You can restore to a new IP address if you configured the controller address with a DNS name by changing the DNS record to the new IP address.
Changing the advertised address in the controller configuration will cause existing enrollments to stop working, so it is essential to configure a DNS name, not an IP address, as the controller address.
I don't know enough about Bolt DB to say whether it's immune to corruption. Backup systems like Velero can't guarantee database integrity, so corruption is possible if the storage is copied during a database operation.
The solution is to use the snapshot command from the backup guide and back up the snapshot files with Velero. If you restore from a snapshot then it will not be corrupted.
If was thinking to trigger a snapshot of the bbolt storage via the ziti controllers API by some sort of hook inside velero and then do the actual snapshot of the pvc. This way the data integrity should not be a concern. But this is also gonna lead to a growing pvc as there will be more and more snapshots over time...
I am not aware of any retention policies that apply directly to EBS-based PVCs. Could you clarify what you mean by retention policies in this context? Are there retention policies that can be configured on the Ziti controller to handle DB snapshot rotation automatically?
(From my point of view, the best case scenario would be if the ziti-controller uploads DB snapshots to an S3-compatible storage on its own.)
Currently, my solution for providing a DB backup in Kubernetes-deployed Ziti networks involves using EFS (NFS) backed storage for the Ziti controller's DB file. This way, the snapshots stored in the same location can be accessed by a Kubernetes job for cleanup and rotation. Additionally, the EFS auto-backup feature allows for restoring older DB snapshot versions.
It would be interesting to know if you have any experience running a Ziti controller that stores its DB file within an NFS/EFS volume.
Alternatively, it would be very helpful if the Ziti controller allowed for configuring a dedicated path for DB snapshots. This would help avoid potential performance issues associated with running the DB on NFS-based storage.
By "retention policies" I meant to suggest a possible solution to the continually growing size of the persistent storage where the controller's database and its snapshots are kept.
I see the value of simplicity in EFS for database persistence, and I hadn't tried that myself. I peeked at EFS latencies, and what AWS has accomplished is impressive. Sub-millisecond reads!? It makes me think this might work. It's much better than the way I remember NFSv4.
Those are both excellent ideas. There's a way to set the output path for a snapshot when triggering snapshot creating through the agent (IPC) instead of the edge (API):
Thanks for your detailed explanation! I should also mention that recent product announcements include NetFoundry On-Prem which provides for installers, a K8S platform to run OpenZiti (available in many different flavours), and the newly developed 'NetFoundry Support Stack' which includes extensive automation (incl. Helm for upgrades), telemetry, events and pre-configured ElasticStack, Grafana and dashboards for deep visibility, troubleshooting and RCA. NF On-Prem also includes lifecycle mngt, technical adoption and GTM support, compliance and liability, and production SLAs. If you don't want NF On-Prem, we are also looking at providing NetFoundry Support Stack as a licensable component.
Happy to chat more about how NetFoundry licensable products make these topics super easy for users.