Ziti Edge Router can't connect after some time (Docker Compose)

Hello, I’ve encountered a possible bug while playing with the Docker Compose setup.

It seems that the Ziti Edge Router can enroll correctly the first times, but after some hours, if you run docker compose down then docker compose up -d, the edge router ends up not connecting.

At first I thought it was something that I was doing wrong, but now I’m not so sure since I could reproduce it with only the Ziti Controller and the Ziti Edge Router.

Here are my files. These files are stored in a folder called zititest. There are no other running containers. Ports are open as expected and first connections work correctly.

As a small disclaimer, I made a small modification to the docker compose, the containers have a healthcheck which is just a pretty dumb check for listening ports so this way Docker won’t start the edge router container if the controller is not yet listening on the expected ports.

docker-compose.yml
services:
  ziti-controller:
    image: "${ZITI_IMAGE}:${ZITI_VERSION}"
    ports:
      - ${ZITI_EDGE_CONTROLLER_PORT:-1280}:${ZITI_EDGE_CONTROLLER_PORT:-1280}
      - ${ZITI_CTRL_PORT:-6262}:${ZITI_CTRL_PORT:-6262}
    environment:
      - ZITI_EDGE_IDENTITY_ENROLLMENT_DURATION=${ZITI_EDGE_IDENTITY_ENROLLMENT_DURATION}
      - ZITI_EDGE_ROUTER_ENROLLMENT_DURATION=${ZITI_EDGE_ROUTER_ENROLLMENT_DURATION}
    env_file:
      - $MAIN_DIR/.env
    networks:
      ziti:
        aliases:
          - ziti-edge-controller
    volumes:
      - ziti-fs:/persistent
    entrypoint:
      - "/var/openziti/scripts/run-controller.sh"
    healthcheck:
      test: ["CMD", "bash", "-c", "lsof -i -P -n | grep -q 'TCP.*:6262' && lsof -i -P -n | grep -q 'TCP.*:1280'"]
      interval: 10s
      timeout: 5s
      retries: 10

  ziti-controller-init-container:
    image: "${ZITI_IMAGE}:${ZITI_VERSION}"
    depends_on:
      ziti-controller:
        condition: service_healthy
    environment:
      - ZITI_CONTROLLER_RAWNAME="${ZITI_CONTROLLER_RAWNAME}"
      - ZITI_EDGE_CONTROLLER_RAWNAME="${ZITI_EDGE_CONTROLLER_RAWNAME}"
    env_file:
      - $MAIN_DIR/.env
    networks:
      ziti:
        aliases:
          - ziti-edge-controller-init-container
    volumes:
      - ziti-fs:/persistent
    entrypoint:
      - "/var/openziti/scripts/run-with-ziti-cli.sh"
    command:
      - "/var/openziti/scripts/access-control.sh"

  ziti-edge-router:
    image: "${ZITI_IMAGE}:${ZITI_VERSION}"
    depends_on:
      ziti-controller:
        condition: service_healthy
      ziti-controller-init-container:
        condition: service_completed_successfully
    environment:
      - ZITI_CONTROLLER_RAWNAME="${ZITI_CONTROLLER_RAWNAME}"
      - ZITI_EDGE_CONTROLLER_RAWNAME="${ZITI_EDGE_CONTROLLER_RAWNAME}"
      - ZITI_EDGE_ROUTER_RAWNAME=${ZITI_EDGE_ROUTER_RAWNAME:-ziti-edge-router}
      - ZITI_EDGE_ROUTER_ROLES=public
    env_file:
      - $MAIN_DIR/.env
    ports:
      - ${ZITI_EDGE_ROUTER_PORT:-3022}:${ZITI_EDGE_ROUTER_PORT:-3022}
    networks:
      - ziti
    volumes:
      - ziti-fs:/persistent
    entrypoint: /bin/bash
    command: "/var/openziti/scripts/run-router.sh edge"
    healthcheck:
      test: ["CMD", "bash", "-c", "lsof -i -P -n | grep -q 'TCP.*:3022'"]
      interval: 10s
      timeout: 5s
      retries: 10

networks:
  ziti:

volumes:
  ziti-fs:
.env
# Generic
MAIN_DIR=/home/ubuntu/zititest

# OpenZiti Variables
ZITI_IMAGE=openziti/quickstart
ZITI_VERSION=0.27.9

# The duration of the enrollment period (in minutes), default if not set
# shown - 7days
ZITI_EDGE_IDENTITY_ENROLLMENT_DURATION=10080
ZITI_EDGE_ROUTER_ENROLLMENT_DURATION=10080

# controller address/port information
ZITI_CONTROLLER_RAWNAME=ziti-controller
ZITI_CONTROLLER_HOSTNAME=mymachine.mydomain.com

ZITI_EDGE_CONTROLLER_RAWNAME=ziti-edge-controller
ZITI_EDGE_CONTROLLER_HOSTNAME=mymachine.mydomain.com
ZITI_EDGE_CONTROLLER_IP_OVERRIDE=11.22.33.44

# router address/port information
ZITI_EDGE_ROUTER_RAWNAME=mymachine.mydomain.com
ZITI_EDGE_ROUTER_IP_OVERRIDE=11.22.33.44

Upon first running docker compose up -d the containers start nicely, you can check the logs here (some logs are big so I am using pastebin, hope that’s okay)
Ziti Controller logs
Ziti Init Controller Container logs
Ziti Edge Router logs

I left them running for a while. I’ve made no connections at all, nor have I created any identities. All with just the defaults.

Then I performed the following
docker compose down at 13:28
docker compose up -d at 13:30 → logs looked good
docker compose down at 13:50
docker compose down at 13:51 → logs looked good

And then I left it running for a couple hours, and I ran
docker compose down at 16:02
docker compose up -d at 16:03

And then the following happened:
Ziti Controller logs (check errors at the end)
Ziti Init Controller Container Logs (adding them just for completion, but no relevant info here I think)
Ziti Edge Router logs

It seems like the edge router failed executing some Ziti command because the request got Unauthorized.

Could this be related to the ZITI_PWD issue we discussed in another post? Or am I doing something wrong?

Thanks in advance for your time :bowing_man:

Hi @jruiz94, thanks for the very detailed issue report, that’s extremely helpful. @TheLumberjack and I were just looking at this, he had a compose env that was running for 13 days. He brought it down and back up and saw the same issue. We are thinking it’s related to the cli’s cached credentials in conjunction with the ZITI_PWD issue you’re familiar with.

I’m taking a deeper look into this now to figure out the root cause and solution.

I’m also going to be pushing a new docker image that resolves the ZITI_PWD issue.

1 Like

That sounds fantastic, thanks! I’ll wait for the updates then :smiley:

Hi again @gberl002 , I just saw the Docker Image got updated. First of all, thanks for the quick update :slight_smile: Though I noticed that the image version didn’t change, it’s still 0.27.9, wouldn’t it be better to have 0.27.10 or any different version to differentiate the one with the bug and the one without it? I’m just thinking out loud, it doesn’t affect me but I thought it was strange.

Anyways, thanks for the update on the ZITI_PWD issue, I’ll be waiting for updates on the issue I mentioned on the original message. By any chance, do you have a Github ticket or issue that I can subscribe to get updates on that? Thanks in advance

Ah, you caught me, I made a mistake and published the “dev” branch.

I could republish the release branch, but while it’s there, would you be willing to pull the latest and try running your scenario again? You may have to delete the image you have and pull the latest. Or, if you want to save it for some reason, you could retag it temporarily.

Regarding my bug findings:
It’s a strange situation, I’m able to reproduce the issue on Linux, but not Mac, and it definitely is partly due to the ZITI_PWD bug however, if you look in the second router log, you’ll notice it says

----------  Creating edge-router mymachine.mydomain.com
...auth errors...
---------- Enrolling edge-router mymachine.mydomain.com

This should not be happening since the run-router.sh script looks for an existing config and skips the router creation and enrollment if an existing config is found. There is some other strangeness surrounding this issue but without delving deeper into details, the issue is resolved with the ZITI_PWD fix. So, again, if you could check this out for yourself that would be much appreciated.

1 Like

Sure! I'll test it when I have some time :slight_smile:

@gberl002 Okay so I’m currently testing, so far no errors in the logs after a couple runs, but just to be sure I’ll be shutting this down and up a couple times for the following days.

One thing that I I’m unsure if we should be concerned is that the logs seem to throw an error around here:

zititest-ziti-controller-init-container-1  | /persistent/ziti.env: line 56: alias: -y: not found
zititest-ziti-controller-init-container-1  | /persistent/ziti.env: line 57: alias: -y: not found

I’ve made a pastebin if you want to check the full logs for the first run

ooooh. I fixed that bug but I bet it’s baked into the docker image… I’m going to republish the containers from ‘main’…

“that bug” PR: remove errant quote by dovholuknf · Pull Request #1097 · openziti/ziti · GitHub If you’re interested :slight_smile:

1 Like

Ahahahahah this is a classic one :laughing:

Pesky computers, one single byte often matters …

1 Like

I see the image 0.27.9 updated some minutes ago, but I tested and it seems to have the same zitiLogin issue, so I assume I should wait for 0.27.10?

Really. I made sure to look at main first to main sure it wouldn't... On main, there's not even a "-y" flag... https://github.com/openziti/ziti/blob/main/quickstart/docker/image/ziti-cli-functions.sh#L48

function zitiLogin {
  "${ZITI_BIN_DIR-}/ziti" edge login "${ZITI_EDGE_CTRL_ADVERTISED}" -u "${ZITI_USER-}" -p "${ZITI_PWD}" -c "${ZITI_PKI_OS_SPECIFIC}/${ZITI_EDGE_CONTROLLER_ROOTCA_NAME}/certs/${ZITI_EDGE_CONTROLLER_INTERMEDIATE_NAME}.cert"
}

I then went and did exactly what you said, but I did a docker compose pull. It seemed ok to me. I also did docker compose down -d to remove everything. Makes me think it's not the latest?

1218

Oh, you could exec into the container and inspect the /var/openziti/scripts/ziti-cli-functions.sh to make sure it’s not using the -y flag

Sorry I might not be explaining myself correctly.

^ What I said here, is not happening anymore, thats good :+1: ^

What I referred here, is the issue i mentioned here.

Regarding what you sent here:

You ran the docker compose down -v command: the -v flag removes the volumes, if you run docker compose up, wait for startup, then run docker compose down keeping the volumes, the next time you run docker compose up you won't be able to do zitiLogin inside the controller because of the ZITI_PWD issue that we were discussing in the other thread

I might have misunderstood @gberl002 messages but I thought that was the issue he was referring to when he said:

Sorry, I might have not explained myself correctly :sweat_smile:

For the record, these are the commands that I'm running

# Clean start removing everything
docker volume rm $(docker volume ls --quiet)
docker image rm $(docker image ls --quiet)
# Start up compose
docker compose up -d
docker exec -it zititest-ziti-edge-controller-1 bash
# (inside container)
zitiLogin <---- this works good
# (exit container)
docker compose down
docker compose up -d
docker exec -it zititest-ziti-edge-controller-1 bash
# (inside container)
zitiLogin <----- failure

Yep I just checked it seems good, I'll wait for the updates on the ZITI_PWD and the "router disconnecting after some docker compose ups/downs" issues.

Ahhh. Yeah that's the ZITI_PWD issue. I see what you mean. That's still in the process of getting fixed. :slight_smile:

1 Like