Controller stuck on bootup

Hello, and Merry Christmas
I'm new to OpenZiti. The project sounds quite interesting and i'm trying to setup a network for my homelab.
I'm fallowing the following guides:
Setting Up Oracle Cloud To Host OpenZiti (i tried latest ubuntu both normal and minimal images)
Host OpenZiti Anywhere | OpenZiti
I got stuck on the expressInstall on the waiting for the controller to come online to allow the edge router to enroll(i'm getting this up only with external ip so i'm skipping export EXTERNAL_DNS="....")
i did some reading and debugging, i found this forum post that seems to be quite similar to my problem. The main issue is that for CQDet2803 the problem magically disappeared :expressionless:
when i run ziti ops verify-network --controller-config-file $HOME/.ziti/quickstart/$(hostname)/$(hostname).yaml
i get

INFO    Verifying controller config: /home/ubuntu/.ziti/quickstart/instance-20241225-1343/instance-20241225-1343.yaml
ERROR   controller advertise address at <my_public_ip>:8440 cannot be reached.
INFO    verifying 1 web entries
INFO    verifying 1 web bindPoints
ERROR   web entry[client-management], bindPoint[0] address at <my_public_ip>:8441 cannot be reached.

ERROR   One or more error. Review the output above for errors.

running netstat -ano | grep 844 | grep LIST
i get

tcp6       0      0 :::8441                 :::*                    LISTEN      off (0.00/0/0)
tcp6       0      0 :::8440                 :::*                    LISTEN      off (0.00/0/0)

i assume it's an IPv4vsIPv6 issue, and i hope there is a way to force ziti to use v4, but thru my googling i couldn't fin anything
i checked the $HOME/.ziti/quickstart/$(hostname)/$(hostname).yaml file as i assume this is the main config file, all the addresses are in ipv4 format so i either have 0.0.0.0:<port> or <my_public_ip>:<port>
if i do ifconfig i can see both ipv4 and ipv6 addresses, my next step was be to completely disable ipv6 on the machine(i don't think this is the right solution, but i'm a bit desperate, also i cut myself off the instance on my first attempt😅)
Any help would be highly appreciated

Hi @lex529, Merry Christmas to you as well and welcome to the community and to OpenZiti (and zrok/BrowZer)!

Yes that looks like an IPv6 issue. The listener is exclusively listening on TCP6, ::.

This is very strange. You're saying the machine HAS an IPv4 address? You can see a locally defined IPv4 address?

Can you share the two sections:

specifically the listener section:

ctrl:
  options:
    advertiseAddress: tls:ec2-3-18-113-172.us-east-2.compute.amazonaws.com:8440
  listener:             tls:0.0.0.0:8440

and the 'interface' section of the web

web:
  - name: new-address
    bindPoints:
      - interface: 0.0.0.0:8441

are EITHER of those the advertised address or are both of the 0.0.0.0? Does the controller log any errors (in $HOME/.ziti/quickstart/$(hostname)/$(hostname).log)

I've never seen the ipv4 bind fail like this. :confused:

This is very strange. You're saying the machine HAS an IPv4 address? You can see a locally defined IPv4 address?

Yes, i'm sshing into the machine via IPv4 address, i can see both IPv4 and IPv6 when i do ifconfig

$ ifconfig
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.0.201.163  netmask 255.255.248.0  broadcast 10.0.207.255
        inet6 fe80::200:17ff:fe02:d0f0  prefixlen 64  scopeid 0x20<link>
        ether 00:00:17:02:d0:f0  txqueuelen 1000  (Ethernet)
        RX packets 131168  bytes 396051969 (396.0 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 100176  bytes 41194310 (41.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2758  bytes 527405 (527.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2758  bytes 527405 (527.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

the subnet on oci isn't configured for IPv6

Can you share the two sections:

# the endpoint that routers will connect to the controller over.
ctrl:
  options:
    advertiseAddress: tls:<my_public_ip>:8440
  # (optional) settings
  # set the maximum number of connect requests that are buffered and waiting to be acknowledged (1 to 5000, default 1)
  #maxQueuedConnects:      1
  # the maximum number of connects that have  begun hello synchronization (1 to 1000, default 16)
  #maxOutstandingConnects: 16
  # the number of milliseconds to wait before a hello synchronization fails and closes the connection (30ms to 60000ms, default: 5000ms)
  #connectTimeoutMs:       5000
  listener:             tls:0.0.0.0:8440
web:
  # name - required
  # Provides a name for this listener, used for logging output. Not required to be unique, but is highly suggested.
  - name: client-management
    # bindPoints - required
    # One or more bind points are required. A bind point specifies an interface (interface:port string) that defines
    # where on the host machine the webListener will listen and the address (host:port) that should be used to
    # publicly address the webListener(i.e. mydomain.com, localhost, 127.0.0.1). This public address may be used for
    # incoming address resolution as well as used in responses in the API.
    bindPoints:
      #interface - required
      # A host:port string on which network interface to listen on. 0.0.0.0 will listen on all interfaces
      - interface: 0.0.0.0:8441
        # address - required
        # The public address that external incoming requests will be able to resolve. Used in request processing and
        # response content that requires full host:port/path addresses.
        address: <my_public_ip>:8441

Does the controller log any errors

no errors, there are 2 warnings at the beginning

[   0.074] WARNING ziti/controller/config.LoadConfig: this environment is using a default generated trust domain [spiffe://<some_guid>], it is recommended that a trust domain is specified in configuration via URI SANs or the 'trustDomain' field
[   0.075] WARNING ziti/controller/config.LoadConfig: this environment is using a default generated trust domain [spiffe://<some_guid>], it is recommended that if network components have enrolled that the generated trust domain be added to the configuration field 'additionalTrustDomains' array when configuring a explicit trust domain

but at the end it says that the server is listening on all ports

INFO xweb/v2.(*Server).Start: starting ApiConfig to listen and serve tls on 0.0.0.0:8441 for server client-management with APIs: [edge-management edge-client fabric]
[   2.376]    INFO ziti/controller/network.(*Network).Run: started

maybe i'm a bit to paranoid that i'm removing the public ip, from the config/logs, if that is needed i can provide it.
I've also double-checked that it's the right ip.I copied the ip form the express install script

waiting for the controller to come online to allow the edge router to enroll
waiting for https://<my_public_ip>:8441

and did a search over the $(hostname).log and $(hostname).yaml files to be 100% sure it's the same, also used the same thing from the clipboard to ssh into a new terminal.

What happens if you replace the public IP in your two config sections with 0.0.0.0, do you get an IPv4 listener after that? That's something different between your config and mine. I wonder if somehow that public IP is the issue?

Could you try that, maybe?

i've tried changed the ip in the 2 places but i got an error when trying to run the router with ziti controller run $(hostname).yaml, but got a panic

 INFO channel/v3.(*UnderlayDispatcher).Run: started
panic: error validating ApiConfig binding edge-client: could not find [edge.api.address] value [<my_public_ip>:8441] as a bind point any instance of ApiConfig [edge-client]

afterwards went in into the .yaml file and changed the following section

edge:
  # This section represents the configuration of the Edge API that is served over HTTPS
  api:
    #(optional, default 90s) Alters how frequently heartbeat and last activity values are persisted
    # activityUpdateInterval: 90s
    #(optional, default 250) The number of API Sessions updated for last activity per transaction
    # activityUpdateBatchSize: 250
    # sessionTimeout - optional, default 30m
    # The number of minutes before an Edge API session will time out. Timeouts are reset by
    # API requests and connections that are maintained to Edge Routers
    sessionTimeout: 30m
    # address - required
    # The default address (host:port) to use for enrollment for the Client API. This value must match one of the addresses
    # defined in this Controller.WebListener.'s bindPoints.
    # address: <my_public_ip>:8441
    address: 0.0.0.0:8441

now, if i do cat "$(hostname).yaml" | grep "$(curl -s eth0.me)" the only places where the ip is uncommented is part of cert/server_cert/key values

changing the edge.api.address got the controller running again with the following output

[   1.918]    INFO channel/v3.(*UnderlayDispatcher).Run: started
[   2.485]    INFO xweb/v2.(*Server).Start: starting ApiConfig to listen and serve tls on 0.0.0.0:8441 for server client-management with APIs: [edge-management edge-client fabric]
[   2.486]    INFO ziti/controller/network.(*Network).Run: started

but when i'm running netstat -ano | grep 844 | grep LIST
i still get the bind on tcp6

tcp6       0      0 :::8441                 :::*                    LISTEN      off (0.00/0/0)
tcp6       0      0 :::8440                 :::*                    LISTEN      off (0.00/0/0)

i've checked the full netstat output, one thing that looked at least a bit strange(at least from the perspective of a noob), the ssh is also listening on tcp6 but "advertised" over IPv4

tcp6       0    724 10.0.201.163:22         <my_home_pc_ip>:61628     ESTABLISHED on (0.18/0/0)
tcp6       0      0 10.0.201.163:22         <my_home_pc_ip>:61656     ESTABLISHED keepalive (4365.43/0/0)

after a small break, and getting bored/pissed off😅, based on the above observation, i went in and change all the references to the IPs, so i change everything from 0.0.0.0 -> 10.0.201.163(my private ip), started the controller again

[   1.959]    INFO channel/v3.(*UnderlayDispatcher).Run: started
[   2.467]    INFO xweb/v2.(*Server).Start: starting ApiConfig to listen and serve tls on 10.0.201.163:8441 for server client-management with APIs: [edge-management edge-client fabric]
[   2.469]    INFO ziti/controller/network.(*Network).Run: started

i run netstat -ano | grep 844 | grep LIST

tcp        0      0 10.0.201.163:8440       0.0.0.0:*               LISTEN      off (0.00/0/0)
tcp        0      0 10.0.201.163:8441       0.0.0.0:*               LISTEN      off (0.00/0/0)

i think this is progress, had a small moment of joy😆, but still nothing works(i tried to make a request to https://<public_ip>:8441, i tried adding the admin console) still nothing, i'm running Ubuntu 24.04.1 LTS
Can it be that the it has to do with the default behavior of the OS, when it comes to how it listens to things?

The only place in your config file an IP might be referenced is in the advertised section. The other sections should all remain as 0.0.0.0 until we work through what's happening.

I reread the post today, you mentioned OCI and Ubuntu 24.04. could firewalld somehow be in this mix and blocking ports, or perhaps selinux is somehow enabled and interesting? Are you using an arm based distro or x64?

This is the case, as i get 2 free instances, i went and made a new instance using Canonical-Ubuntu-22.04-Minimal. I had a deployment script that i was using, as i redeployed the instance a couple of times(so it's the exact thing that I've tried to execute on the 24.04 instance). When i used 22.04 the thing went up instantly. I manage to demo a tunnel and all that.

The firewall was not the issue, i did as advised in the guide:

sudo firewall-cmd --zone=public --add-port=8440/tcp --permanent
sudo firewall-cmd --zone=public --add-port=8441/tcp --permanent
sudo firewall-cmd --zone=public --add-port=8442/tcp --permanent
sudo firewall-cmd --zone=public --add-port=8443/tcp --permanent
sudo firewall-cmd --zone=public --add-port=10080/tcp --permanent

Also on boot-up the controller checks if the ports are opened and that worked
if you are thinking about the Security List that is configured at the vnet level in OCI, that is not the case as i'm using the same vnet for both instances.

or perhaps selinux is somehow enabled and interesting?

No clue, I haven't touched anything in regards to this, i would just boot a new instance and basically paste all the lines that i fallowed from the 2 guides.

x64

I still have both instances up and running if you, if you have any other idea i can give it a try and maybe make it work on 24.04

So you're saying that Ubuntu minimal didn't work, but Ubuntu full did? An I understanding you correctly?

That's really interesting if that's the case. I don't know what three difference would be.

no, no. I tried 24 both minimal and full and it didn't work, but when i tried 22(minimal), it did

Oh ok. That's definitely unexpected then. I dunno what could be different between those os versions to cause that sort of behavior. I'll see what we can learn about this next week sometime. Cheers

Sadly, I am not able to reproduce this issue.

I used Canonical Ubuntu 20.04 and Image 2024.10.02-0 with a 2cpu AMD machine:

One thing that DID happen to me, as I followed the instructions on the blog, is firewalld needed to be installed. So if you copy and paste the block as shown at Setting Up Oracle Cloud To Host OpenZiti -- it DOES NOT update firewalld... you have to do the "install" and "authorize" in two separate steps.

Could you be getting bitten by that same issue I did?

Can you run sudo firewall-cmd --list-all ?

sudo firewall-cmd --list-all
public
  target: default
  icmp-block-inversion: no
  interfaces:
  sources:
  services: dhcpv6-client ssh
  ports: 8440/tcp 8441/tcp 8442/tcp 8443/tcp 10080/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Im having exactly the same issue, tried different versions of Ubuntu and Debian.
The Port is everytime bound as type tcp6

Any news regarding the cause of this issue and maybe how to solve it?

Hi @Rewan - welcome to the community and to OpenZiti (and zrok/BrowZer),

can you tell me the exact steps you're using? When the controller is stuck waiting, can you run ss -lntp and show me the output (and maybe netstat -ano | grep 844 as well just to confirm)?

I can't reproduce it, so I can't troubleshoot it :frowning:

If you want to get on a zoom and debug live - i can facilitate. That might be easier... I could DM you a link if you want to go that way?

1 Like

Output of ss -lntp:

LISTEN            0                 4096                         127.0.0.53%lo:53                              0.0.0.0:*                users:(("systemd-resolve",pid=794,fd=13))            
LISTEN            0                 128                                0.0.0.0:22                              0.0.0.0:*                users:(("sshd",pid=890,fd=3))                        
LISTEN            0                 128                                   [::]:22                                 [::]:*                users:(("sshd",pid=890,fd=4))                        
LISTEN            0                 4096                                     *:8440                                  *:*                users:(("ziti",pid=1832,fd=9))                       
LISTEN            0                 4096                                     *:8441                                  *:*                users:(("ziti",pid=1832,fd=10))

Output of netstat:

tcp6       0      0 :::8440                 :::*                    LISTEN      off (0.00/0/0)
tcp6       0      0 :::8441                 :::*                    LISTEN      off (0.00/0/0)

Sure that would be great

Today, @Rewan and I (and @qrkourier) hopped on a call. Turns out if you bind tcp4 AND tcp6, linux might only report it as tcp6... So although it appeared that the controller was not bound to 0.0.0.0, it was actually bound to that IP.

We were able to verify the controller was indeed online and running by issuing a curl when ssh'ed to the machine, in this case:

curl -sk https://127.0.0.1:8441

This curl returned json, indicating the server was bound and listening properly. There must be some sort of firewall/NAT in between that's interfering with outside communications.

Thanks @Rewan for hopping on a call and diagnosing the issue.

@lex529 -- FYI - can you try that curl as well? This information along with me not being able to reproduce make me think this is somewhere/somehow firewall related (not OpenZiti)

1 Like

@TheLumberjack, i can't do much about this, my oci account got terminated and no reason was provided. I'm moving all the setup to azure now.
If you want to look more into it I can provide the terrafrom for setting up everything.

No worries. I actually think things might have been working before anyway, it was just obscured due to the way linux is working and firewalld/iptables/cloud provider firewall.

Hopefully your Azure experience will be smoother!