Helm Port Mappings

Sure thing!

The thing is, even though they are in the same cluster, I do plan on having multiple routers eventually, so I would prefer to use the public IP address instead of the .svc.cluster.local internal DNS name.

As far as the linkListeners.transport, I currently have it enabled (default from the values.yaml).

At the moment, I have a bit of confusion, as metallb seems to only be unhappy to give the edge router an external-ip address:

kubectl -n ziti get svc
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
ziti-controller-ctrl     ClusterIP      10.43.115.112   <none>          443/TCP          36d
ziti-controller-client   LoadBalancer   10.43.249.23    192.168.1.206   8441:31808/TCP   36d
ziti-router-edge         LoadBalancer   10.43.173.20    <pending>       8443:30645/TCP   91s
ziti-router-transport    ClusterIP      10.43.240.174   <none>          443/TCP          91s

Looking at the metallb controller logs shows this:

{"caller":"service.go:135","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-10-20T20:41:15Z"}
{"caller":"service_controller.go:100","controller":"ServiceReconciler","endpoints":"{\"Type\":0,\"EpVal\":null,\"SlicesVal\":null}","event":"failed to handle service","level":"error","name":"ziti/ziti-router-edge","service":"{\"kind\":\"Service\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"ziti-router-edge\",\"namespace\":\"ziti\",\"uid\":\"f073f6bf-b707-4a74-a09b-fb3c0f21a593\",\"resourceVersion\":\"13100815\",\"creationTimestamp\":\"2023-10-20T20:41:15Z\",\"labels\":{\"app.kubernetes.io/instance\":\"ziti-router\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"ziti-router\",\"app.kubernetes.io/version\":\"0.29.0\",\"helm.sh/chart\":\"ziti-router-0.6.0\"},\"annotations\":{\"meta.helm.sh/release-name\":\"ziti-router\",\"meta.helm.sh/release-namespace\":\"ziti\"},\"managedFields\":[{\"manager\":\"helm\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2023-10-20T20:41:15Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:meta.helm.sh/release-name\":{},\"f:meta.helm.sh/release-namespace\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/instance\":{},\"f:app.kubernetes.io/managed-by\":{},\"f:app.kubernetes.io/name\":{},\"f:app.kubernetes.io/version\":{},\"f:helm.sh/chart\":{}}},\"f:spec\":{\"f:allocateLoadBalancerNodePorts\":{},\"f:externalTrafficPolicy\":{},\"f:internalTrafficPolicy\":{},\"f:ports\":{\".\":{},\"k:{\\\"port\\\":8443,\\\"protocol\\\":\\\"TCP\\\"}\":{\".\":{},\"f:name\":{},\"f:port\":{},\"f:protocol\":{},\"f:targetPort\":{}}},\"f:selector\":{},\"f:sessionAffinity\":{},\"f:type\":{}}}}]},\"spec\":{\"ports\":[{\"name\":\"edge\",\"protocol\":\"TCP\",\"port\":8443,\"targetPort\":3022,\"nodePort\":30463}],\"selector\":{\"app.kubernetes.io/component\":\"ziti-router\",\"app.kubernetes.io/instance\":\"ziti-router\",\"app.kubernetes.io/name\":\"ziti-router\"},\"clusterIP\":\"10.43.204.145\",\"clusterIPs\":[\"10.43.204.145\"],\"type\":\"LoadBalancer\",\"sessionAffinity\":\"None\",\"externalTrafficPolicy\":\"Cluster\",\"ipFamilies\":[\"IPv4\"],\"ipFamilyPolicy\":\"SingleStack\",\"allocateLoadBalancerNodePorts\":true,\"internalTrafficPolicy\":\"Cluster\"},\"status\":{\"loadBalancer\":{}}}","ts":"2023-10-20T20:41:15Z"}

I have tried setting the IPAddressPool for the router to a few different ones, and it is not assigning it, but only for the router.

As you can see from the get svc above, the controller picks up the external-ip address just fine.

Also, just to debug, I set the edge section to be ClusterIP, and the same invalid header message shows in the router pod logs.

Here are the relevant bits of the values.yaml for the router:

ctrl:
  endpoint: ziti-controller.domain.com:8441

advertisedHost: ziti-router.domain.com


linkListeners:
  transport:  # https://docs.openziti.io/docs/reference/configuration/router/#transport
    containerPort: 10080
    advertisedHost: #router11-transport.router-namespace.svc:443
    advertisedPort: 443
    service:
      enabled: true
      type: ClusterIP
      labels:
      annotations:
    ingress:
      enabled: false
      annotations:

edge:
  enabled: true
  containerPort: 3022
  advertisedHost: ziti-router.kincke.com #router11-edge.ziti.example.com
  # advertisedPort: 443
  advertisedPort: 8443
  service:
    # -- create a cluster service for the edge listener
    enabled: true
    type: LoadBalancer
    annotations:
      metallb.universe.tf/address-pool: ziti-router
    labels:
    annotations:
  ingress:
    enabled: false
    annotations:

Maybe you will spot something that I missed :slight_smile:

Not only the router's link listener must be exposed to other routers, but also the controller's ziti-controller-ctrl service must be exposed to all routers. That service is the target of the routers' configuration value in ctrl.endpoint.

So, you may configure all routers to connect to this ctrl.endpoint that is provided by the controller through MetalLB, and any routers that happen to be inside the same cluster as the controller could use the cluster.local service address as the value of ctrl.endpoint in their configs, but it's not required.

In the list of k8s services ziti-controller-ctrl doesn't have an external IP, and shows <none>, so I think it's not configured to be publicly-exposed to routers. I'm sure that is the cause of "invalid header."

That all makes sense.

It seems like the best remedy would be to give the ctrlPlane in the values.yaml for the controller a LoadBalancer.
Do I have that correct?

I believe that's correct. That's certainly the correct part of the controller chart's values.yaml. The value ctrlPlane.service type determines the value of type in the Kubernetes Service resource. It sounds like you have a service controller that looks for Service resources of type LoadBalancer, and assigns them an external IP.

The other valid values for type are ClusterIP and NodePort. ClusterIP can be used in conjunction with an Ingress controller to publish K8s services, and NodePort makes the K8s service directly exposed on the specified TCP port on the IP address of all worker nodes.

Great!

Yes, I am using MetalLB. It does exactly that :slight_smile:
I have assigned two IPAddressPools for the controller: one for the ctrl and one for the client:

kubectl -n ziti get svc
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
ziti-controller-ctrl     LoadBalancer   10.43.115.112   192.168.1.206   8440:31048/TCP   36d
ziti-controller-client   LoadBalancer   10.43.249.23    192.168.1.207   8441:31055/TCP   36d

I changed the values.yaml for the router to be this:

ctrl:
  endpoint: ziti-controller.domain.com:8440

Now when I try to install the router, we have a certificate error:

[   0.020]   ERROR fabric/router/env.(*networkControllers).connectToControllerWithBackoff.func2: {endpoint=[tls:ziti-controller.domain.com:8440] error=[error connecting ctrl (tls: failed to verify certificate: x509: certificate is valid for localhost, ziti-controller, ziti-controller-ctrl, ziti-controller-ctrl.ziti, ziti-controller-ctrl.ziti.svc, ziti-controller-ctrl.ziti.svc.cluster.local, ziti-controller-ctrl.ziti.svc.cluster.local, not ziti-controller.domain.com)]} unable to connect controller

So you know, I do have cert-manager running with a ClusterIssuer that can answer DNS-01 challenges.

The controller expects routers to reach the ctrlPlane service via the same address it is advertising on the ctrlPlane.advertisedPort. This facilitates mutual TLS.

The advertised address is set in value ctrlPlane.advertisedHost. If you upgrade the controller Helm release with these values the DNS SAN for the advertised address will be added to the server cert.

Okay, so if I modify the values.yaml for the controller like this:

ctrlPlane:
  dnsNames:
    - ziti-controller.domain.com

clientApi:
  dnsNames:
    - ziti-controller.domain.com

And the values.yaml for the router like this:

csr:
    dns:
      - ziti-router.domain.com

should that remedy everything?

The dnsNames lists don't have a Ziti-specific purpose. They exist to enable you to specify additive names, in case you have a reason for that.

The ctrlPlane.advertisedHost controller chart value determines the DNS SAN that routers will use to verify the controller's TLS server certificate: https://github.com/openziti/helm-charts/blob/main/charts/ziti-controller/templates/ca-router-ctrl-identity.yaml#L118

Similarly for the client API, the advertised host is a domain name that edge clients will resolve to find the client API, and determines the DNS SAN for that API's web listener.

No, it's unnecessary to set an additional DNS SAN for the router either. Only the router's advertisedHost value must be set. This is the domain name that edge clients and other routers will resolve to find their listeners. If the edge listener (consumed by identities) and link listener (consumed by routers) are to be reached via distinct domain names then the advertised hosts for each may be set under the edge and linkListener sections, respectively.

Given that I have this for the controller:

ctrlPlane:
  advertisedHost: ziti-controller.domain.com

why am I receiving this error?

[   0.020]   ERROR fabric/router/env.(*networkControllers).connectToControllerWithBackoff.func2: {endpoint=[tls:ziti-controller.domain.com:8440] error=[error connecting ctrl (tls: failed to verify certificate: x509: certificate is valid for localhost, ziti-controller, ziti-controller-ctrl, ziti-controller-ctrl.ziti, ziti-controller-ctrl.ziti.svc, ziti-controller-ctrl.ziti.svc.cluster.local, ziti-controller-ctrl.ziti.svc.cluster.local, not ziti-controller.domain.com)]} unable to connect controller

That reads to me like a DNS-01 challenge was not completed by the controller.

Do I need to do something special with this ctrlPlane.alternativeIssuer?

I see you have ctrlPlane.advertisedHost in your controller chart values set to ziti-controller.domain.com. Have you upgraded the installed controller release with the updated values?

When you do, I expect the template for the control plane PKI will be rendered with that domain name in the list of DNS SANs for the Certificate resource.

Finally, after changing one of the resources used by the controller, you must restart the controller process by deleting the running pod.

kubectl delete pod --selector app.kubernetes.io/instance=ziti-controller

If you've already done both of these steps then let's investigate the current state of the Certificate resource that was supposed to be modified when you upgraded the Helm release with the new value.

Here's an example from my lab env.

kubectl get certificates ziti-controller-ctrl-plane-identity --output jsonpath='{.spec.dnsNames}' | jq
[
  "localhost",
  "ziti-controller",
  "ziti-controller-ctrl",
  "ziti-controller-ctrl.miniziti",
  "ziti-controller-ctrl.miniziti.svc",
  "ziti-controller-ctrl.miniziti.svc.cluster.local",
  "ziti-controller-ctrl.miniziti.svc.cluster.local"
]

Brilliant! That was the trick!
I had upgraded the helm release, but deleting the pod made the certificates happy.

I do feel like I have stumbled upon a bug with the router, however.

kubectl -n ziti get svc
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
ziti-controller-ctrl     LoadBalancer   10.43.115.112   192.168.1.206   8440:31048/TCP   39d
ziti-controller-client   LoadBalancer   10.43.249.23    192.168.1.207   8441:31055/TCP   39d
ziti-router-transport    ClusterIP      10.43.203.160   <none>          443/TCP          21m
ziti-router-edge         LoadBalancer   10.43.30.106    <pending>       8443:30558/TCP   21m
ziti-console             ClusterIP      10.43.6.35      <none>          80/TCP           8m53s

The controller grabs a LoadBalancer IP address from MetalLB perfectly, but the router hangs with pending.
I tailed the MetalLB controller logs and found this:

{"caller":"service_controller.go:60","controller":"ServiceReconciler","level":"info","start reconcile":"ziti/ziti-router-transport","ts":"2023-10-24T03:32:15Z"}
{"caller":"service_controller.go:103","controller":"ServiceReconciler","end reconcile":"ziti/ziti-router-transport","level":"info","ts":"2023-10-24T03:32:15Z"}
{"caller":"service_controller.go:60","controller":"ServiceReconciler","level":"info","start reconcile":"ziti/ziti-router-edge","ts":"2023-10-24T03:32:15Z"}
{"caller":"service.go:135","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-10-24T03:32:15Z"}
{"caller":"service_controller.go:100","controller":"ServiceReconciler","endpoints":"{\"Type\":0,\"EpVal\":null,\"SlicesVal\":null}","event":"failed to handle service","level":"error","name":"ziti/ziti-router-edge","service":"{\"kind\":\"Service\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"ziti-router-edge\",\"namespace\":\"ziti\",\"uid\":\"f6ad97b5-5255-4588-9d8f-7ba1b726ae45\",\"resourceVersion\":\"13649389\",\"creationTimestamp\":\"2023-10-24T03:32:15Z\",\"labels\":{\"app.kubernetes.io/instance\":\"ziti-router\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"ziti-router\",\"app.kubernetes.io/version\":\"0.29.0\",\"helm.sh/chart\":\"ziti-router-0.6.0\"},\"annotations\":{\"meta.helm.sh/release-name\":\"ziti-router\",\"meta.helm.sh/release-namespace\":\"ziti\"},\"managedFields\":[{\"manager\":\"helm\",\"operation\":\"Update\",\"apiVersion\":\"v1\",\"time\":\"2023-10-24T03:32:15Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:meta.helm.sh/release-name\":{},\"f:meta.helm.sh/release-namespace\":{}},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/instance\":{},\"f:app.kubernetes.io/managed-by\":{},\"f:app.kubernetes.io/name\":{},\"f:app.kubernetes.io/version\":{},\"f:helm.sh/chart\":{}}},\"f:spec\":{\"f:allocateLoadBalancerNodePorts\":{},\"f:externalTrafficPolicy\":{},\"f:internalTrafficPolicy\":{},\"f:ports\":{\".\":{},\"k:{\\\"port\\\":8443,\\\"protocol\\\":\\\"TCP\\\"}\":{\".\":{},\"f:name\":{},\"f:port\":{},\"f:protocol\":{},\"f:targetPort\":{}}},\"f:selector\":{},\"f:sessionAffinity\":{},\"f:type\":{}}}}]},\"spec\":{\"ports\":[{\"name\":\"edge\",\"protocol\":\"TCP\",\"port\":8443,\"targetPort\":3022,\"nodePort\":30558}],\"selector\":{\"app.kubernetes.io/component\":\"ziti-router\",\"app.kubernetes.io/instance\":\"ziti-router\",\"app.kubernetes.io/name\":\"ziti-router\"},\"clusterIP\":\"10.43.30.106\",\"clusterIPs\":[\"10.43.30.106\"],\"type\":\"LoadBalancer\",\"sessionAffinity\":\"None\",\"externalTrafficPolicy\":\"Cluster\",\"ipFamilies\":[\"IPv4\"],\"ipFamilyPolicy\":\"SingleStack\",\"allocateLoadBalancerNodePorts\":true,\"internalTrafficPolicy\":\"Cluster\"},\"status\":{\"loadBalancer\":{}}}","ts":"2023-10-24T03:32:15Z"}
{"caller":"service_controller.go:101","controller":"ServiceReconciler","end reconcile":"ziti/ziti-router-edge","level":"info","ts":"2023-10-24T03:32:15Z"}

I even tried deleting the MetalLB controller pod and recreating the router install, but the same error occurred.

It seems that the relevant error is failed to handle service from MetalLB. I have never experienced this problem previously.

The previous line is an error that seems to be the cause, "no available IPs."

The good news is, it is a bug in metallb, not the router chart :slight_smile:
If you are curious, I elaborated over here.

1 Like

Revisiting this, I tried reinstalling the router chart just a moment ago, but the post-install pod fails.

$ kubectl -n ziti logs ziti-router-post-install-job-ffr9w
+ kubectl -n ziti get secret ziti-router-identity
+ echo 'INFO: identity secret does not exist, attempting router enrollment'
+ ziti router enroll /etc/ziti/config/ziti-router.yaml --jwt /etc/ziti/config/enrollment.jwt --verbose
INFO: identity secret does not exist, attempting router enrollment
[   0.000]   DEBUG ziti/ziti/util.LogReleaseVersionCheck: ZITI_CHECK_VERSION is not 'true'. skipping version check
[   0.083]   DEBUG ziti/router/enroll.(*RestEnroller).Enroll: JWT parsed
[   1.475]   FATAL ziti/ziti/router.enrollGw: enrollment failure: (could not obtain private key: open /etc/ziti/config/tls.key: permission denied)

This smells like a Helm Release life cycle issue, but it's unclear precisely where things went wrong. Will you please answer some clarifying questions?

Does the same problem recur when newly creating a Helm Release with the router chart? You mentioned re-installing, so was your command a helm upgrade for an existing Release, or do you mean that you wish to re-use a Secret resource from a prior Release of the router chart with a new Release?

I see that the tls.key (the private key) is not readable by the router process during the enrollment operation. During this re-install, has the effective UID or GID of the router deployment's container changed, or the filemode or ownership of the tls.key file changed? Can you confirm the tls.key file exists, and its ownership and filemode?

Of course!

I uninstalled the router with this:

helm -n ziti uninstall ziti-router

And then reinstalled it with this:

helm upgrade --install ziti-router openziti/ziti-router --namespace ziti --create-namespace --values values-router.yaml --set-file enrollmentJwt=/tmp/router.jwt

There are no remaining pods following the helm uninstall.

This is the status of the pods after installing:

ziti-router-867c94dfd4-qc4pl         0/1     ContainerCreating   0             9s
ziti-router-post-install-job-btsp2   0/1     Error               0             9s

Given that the post-install-job pod is errored/completed, I cannot shell into it to do any verification.

Do you have something I should try?

Please scale-down the router deployment so there will not be competition for the volume.

kubectl scale deployment --selector app.kubernetes.io/component=ziti-router --replicas=0

Mount the volume on a temporary pod: kubectl create --filename ./debug.yaml with:

apiVersion: v1
kind: Pod
metadata:
  name: debug
  namespace: ziti
spec:
  containers:
  - image: busybox
    command: [ "/bin/sh", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    name: busybox
    resources: {}
    volumeMounts:
    - mountPath: /mnt
      name: ziti-router-volume
  restartPolicy: Always
  serviceAccount: default
  serviceAccountName: default
  volumes:
  - name: ziti-router-volume
    persistentVolumeClaim:
      claimName: ziti-router

Inspect the filesystem permissions.

❯ kubectl exec -ti pod/debug -- sh
# ls -lA /mnt
total 12
-rw-------    1 2171     2171          2436 Nov 10 22:16 ca.crt
-rw-------    1 2171     2171          1419 Nov 10 22:16 client.crt
-rwxr-xr-x    1 root     root             0 Nov 10 22:16 enrollment.jwt
-rw-------    1 2171     2171          1419 Nov 10 22:16 tls.crt
-rwxr-xr-x    1 root     root             0 Nov 10 22:16 tls.key
-rwxr-xr-x    1 root     root             0 Nov 10 22:16 ziti-router.yaml

The router runs as UID 2171, and the error you reported was that the router couldn't read tls.key, so hopefully the POSIX filemodes reveal the problem.

Did you try installing a fresh router and subsequently running the same upgrade command? I'm assuming this has something to do with differences between router chart versions, or the underlying container images they use.

EDIT: I remembered you answered this question about installing a fresh router release. Yes, you're doing a fresh install, and so it's the initial install, not a release upgrade that's failing. That's why we're seeing the problem with post-install hook, not post-upgrade.

The point in the script where it's failing only runs when the init script has detected that the identity secret doesn't exist yet, so it's trying to create tls.key by running ziti router enroll.

That section of the code tries to open the filepath for writing the generated private key, so perhaps there's a problem with the filesystem permissions preventing that? I haven't replicated it with minikube, but perhaps you have a different default permission for volumes.

I tried the debug pod, but it hangs waiting for the router volume:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  27s   default-scheduler  0/1 nodes are available: persistentvolumeclaim "ziti-router" not found. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

You were correct!
Somehow, either the router or the PV/PVC I assigned it were in lockstep.
I uninstalled the router, deleted and recreated the hostpath pair, and then reinstalled.

Everything is up now!
Thank you so much!

Of course, now, I have more queries :slight_smile:

I have created a device identity for an android handheld.
In the ziti mobile edge program, I pressed the "Tap to Connect" power button.
It flashes a green tick mark, vibrates, and then the green instantly goes away.

Also, given that I have traefik handling requests via DNS, what is the procedure to create a service in the console that would intercept a request for something like service.domain.com and point to the EXTERNAL-IP address of the traefik service?

1 Like

You want to use a Ziti service in Android. You've installed Ziti Mobile Edge, but it doesn't start. Let's fix this first, because it's prerequisite to using the service in Android.

Were you able to add the handheld's Ziti Identity that you created in the Ziti Mobile Edge (ZME) app? If so, that means the controller's client API is reachable by ZME. If not, that's the first thing to fix.

In Android, Ziti plugs in like a VPN, so there may be competition with any other VPNs you have running at the same time. In my experience, Android has only allowed me to activate one VPN at a time. Please ensure Ziti is the only VPN-like thing running during troubleshooting.

You can use the Ziti app to generate a log bundle with the "feedback" menu. You might be able to see the problem right away by inspecting the contents. You may wish to send me that log bundle for further analysis at help@openziti.org or a private message here in Discourse.

Here's what should be happening when you click that button after successfully adding an Identity in the app.

  1. ZME calls the Ziti controller's client API. This API is exposed on the controller pod as a cluster Service named like ziti-controller-client. I believe you're using type: LoadBalancer to expose services with Traefik, so that service should have an external IP assigned by Traefik (or was it MetalLB?). If you were instead exposing the service as an Ingress then you'd see the IP address there, not in the list of Services.
  2. ZME negotiates a session with the client API by authenticating with the certificate it received during enrollment.
  3. ZME requests a list of authorized Ziti Services and Routers.
  4. ZME configures Android to "intercept" destinations matching authorized Ziti Service addresses.
  5. When another app connects to an intercepted address (domain name or IP), ZME dials its authorized Ziti Routers to create a circuit. This is the next place things might go wrong with your cluster setup. ZME needs to be able to reach the Router's "edge" listener, which appears in the list of K8s Services with a name like ziti-router-edge. Just like the controller's client API, the edge service too must be exposed on an external address.

Good luck!