Trouble shooting starting a remote public edge router

Well.. this has been a bit of a journey of discovery on the importance of making sure you use the right port numbers. I believe I am past that now.. and are now presented with my next error to troubleshoot :slight_smile:

Setup.

Controller on compute #1
Public Edge Router on compute #2

Situation

  1. created the config on compute #2

ziti create config router edge --routerName "${router_name}" > "${ZITI_HOME}/${router_name}.yaml"

  1. created the public edge router jwt file

ziti edge create edge-router "${router_name}" -o "${ZITI_HOME}/${router_name}.jwt" -t -a "public"

  1. Enrolled the public edge router on compute 2

ziti-router enroll "${ZITI_HOME}/${router_name}.yaml" --jwt "${ZITI_HOME}/${router_name}.jwt" &> "${ZITI_HOME}/${router_name}.enrollment.log"

{"file":"github.com/openziti/edge@v0.22.36/router/enroll/enroll.go:205","func":"github.com/openziti/edge/router/enroll.(*RestEnroller).Enroll","level":"info","msg":"registration complete","time":"2022-07-30T14:32:27.457Z"}

  1. start up the public edge router

ziti-router run "${ZITI_HOME}/${router_name}.yaml" > "${ZITI_HOME}/${router_name}.log"

This is the part that I now have problems with, as it is generating some type of error but don't know how to resolve.

Any tips?

 0.665]    INFO fabric/router/forwarder.(*Faulter).run: started
[   0.665]    INFO fabric/router/forwarder.(*Scanner).run: started
[   0.667]    INFO fabric/router.(*Router).showOptions: ctrl = {"OutQueueSize":4,"MaxQueuedConnects":1,"MaxOutstandingConnects":16,"ConnectTimeout":1000000000,"DelayRxStart":false,"WriteTimeout":0}
[   0.667]    INFO fabric/router.(*Router).showOptions: metrics = {"ReportInterval":60000000000,"MessageQueueSize":10}
[   0.667]    INFO fabric/router.(*Router).initializeHealthChecks: starting health check with ctrl ping initially after 15s, then every 30s, timing out after 15s
[   0.667]    INFO fabric/router.(*Router).startXlinkDialers: started Xlink dialer with binding [transport]
[   0.667]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {idleTime=[10s] maxQueueSize=[1] minWorkers=[1] poolType=[pool.listener.link] maxWorkers=[16]} starting goroutine pool
[   0.667]    INFO fabric/router.(*Router).startXlinkListeners: started Xlink listener with binding [transport] advertising [tls:168.138.13.227:10080]
[   0.668]    INFO edge/router/xgress_edge.(*listener).Listen: {address=[tls:0.0.0.0:3022]} starting channel listener
[   0.668]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {idleTime=[10s] poolType=[pool.listener.xgress_edge] minWorkers=[1] maxWorkers=[16] maxQueueSize=[1]} starting goroutine pool
[   0.668]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [edge] at [tls:0.0.0.0:3022]
[   0.668]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.668]    INFO edge/router/xgress_edge.(*Acceptor).Run: starting
[   0.746]   FATAL ziti/ziti-router/subcmd.run: {error=[error connecting ctrl (channel synchronization)]} error starting

Details of the yaml file

ctrl:
endpoint: tls:ip-address-of-controller:8441

link:
dialers:
- binding: transport
listeners:
- binding: transport
bind: tls:0.0.0.0:10080
advertise: tls:ip-address-of-2nd-compute:10080
options:
outQueueSize: 4

listeners:

bindings of edge and tunnel requires an “edge” section below

  • binding: edge
    address: tls:0.0.0.0:3022
    options:
    advertise: ip-address-of-2nd-compute:3022
    connectTimeoutMs: 1000
    getSessionTimeout: 60s
  • binding: tunnel
    options:
    mode: host #tproxy|host

edge:
csr:
country: US
province: NC
locality: Charlotte
organization: NetFoundry
organizationalUnit: Ziti
sans:
dns:
- instance-20220518-1244
- localhost
ip:
- “127.0.0.1”
- “ip-address-of-2nd-compute”

Ports open

controller
ports: 22 (ssh), 8441(controller), 8442(old edge-router)

testserver
ports: 22 (ssh), 3022(new edge-router)

Hi Markamind:

Can you post your controller config yaml?

You should have a corresponding ctrl entry in the controller

ctrl:
listener: tls:0.0.0.0:8441

Note that port needs to be a different port than those used under

edge:
api:

And

web:
bindpoints

So if 8441 is already used in those two sections you will need to use a different one under ctrl:
I.e 8442

You will also need to open that port through your controllers fw settings and set the routers yaml

ctrl:
endpoint: tls:ip-of-controller:8442 or what ever port you chose on the controller

Regards,

Robert

1 Like

Awesome… I feel that I am getting really close to solving this now.

I may have a problem with the configuration of the controller yaml. It is an old version from an old quickstart… I have a sense its missing somethings but as it never caused any problems… I never got around to updating it…

You will find the controller entries below

ctrl:
  listener:             tls:0.0.0.0:6262
**web**:
- **name**: client-management
 bindPoints:
      - interface: 0.0.0.0:8441
        address: ip-of-controller:8441
 apis:
      - binding: edge-management

When I used port 6262 in the remote edge router.. I had problems retrieving the ca bundle.. this was resolved when I set the ctrl endpoint to port 8441


ctrl:
  endpoint:             tls:controller-ip-address:8441

Other settings I made for the remote edge router yaml file are as shown below.

link:
  dialers:
    - binding: transport
  listeners:
    - binding:          transport
      bind:             tls:0.0.0.0:10080
      advertise:        tls:test-server-ip-address:10080
      options:
        outQueueSize:   4

listeners:
  - binding: edge
    address: tls:0.0.0.0:3022
   options:
      advertise: test-server-ip-address:3022

PS… I also added test-server-ip-address in the SANS IP address list

Markamind:

Not sure I follow your reference to ca bundle. the Router process only communicates with the controller over the port defined by

ctrl:
tls://0.0.0.0:6262

The router yaml needs to have

ctrl:
endpoint: tls:controller-ip-address:6262

or it will never work.

When you enroll a router with the

ziti enroll config.yml --jwt . you are not using the url defined under the router’s ctrl endpoint you are using the url embedded in the jwt.

in one of your posts you state the controller has the following ports open
ports: 22 (ssh), 8441(controller), 8442(old edge-router)

6262 is not listed. You need to open port 6262 on your controller then assuming you enrolled successfully already then your router should connect to the controller

example controller yaml

v: 3

db: /root/.config/ziti/db/ctrl.db

identity:

cert: /root/.config/ziti/pki/TEST1234/certs/TEST1234-ZEC-US-SANJOSE-1-0000001-client.cert

server_cert: /root/.config/ziti/pki/TEST1234/certs/TEST1234-ZEC-US-SANJOSE-1-0000001-server.cert

key: /root/.config/ziti/pki/TEST1234/keys/TEST1234-ZEC-US-SANJOSE-1-0000001-server.key

ca: /root/.config/ziti/pki/TEST1234/certs/TEST1234.cert

events:

jsonLogger:

subscriptions:

  • type: fabric.sessions

include:

  • created

  • type: edge.sessions

include:

  • created

  • type: fabric.usage

handler:

type: file

format: json

path: /tmp/ziti-events.log

maxbackups: 7

ctrl:

listener: tls:0.0.0.0:80

options:

maxQueuedConnects: 50

maxOutstandingConnects: 100

connectTimeoutMs: 3000

mgmt:

listener: tls:0.0.0.0:10000

options:

maxQueuedConnects: 50

maxOutstandingConnects: 100

connectTimeoutMs: 3000

edge:

api:

address: :443

sessionTimeout: 30m

enrollment:

signingCert:

cert: /root/.config/ziti/pki/TEST1234/certs/TEST1234.cert

key: /root/.config/ziti/pki/TEST1234/keys/TEST1234.key

ca: /root/.config/ziti/pki/TEST1234/certs/TEST1234.cert

edgeIdentity:

duration: 1440m

edgeRouter:

duration: 1440m

terminator:

validators:

edge: edge

web:

  • name: management

bindPoints:

address: :443

options: { }

apis:

  • binding: fabric

options: { }

  • binding: edge-management

options: { }

  • binding: edge-client

options: { }

Corresponding example Router config

v: 3

identity:

cert: “certs/identity.cert.pem”

server_cert: “certs/internal.chain.cert.pem”

key: “certs/internal.key.pem”

ca: “certs/intermediate-chain.pem”

edge:

csr:

country:

locality:

organization:

organizationalUnit:

sans:

dns:

  • “localhost”

ip:

  • “127.0.0.1”

  • “”

ctrl:

endpoint: tls::80

link:

dialers:

  • binding: transport

listeners:

  • binding: edge

address: tls:0.0.0.0:443

options:

advertise: :443

maxQueuedConnects: 50

maxOutstandingConnects: 100

connectTimeoutMs: 3000

  • binding: tunnel

options:

svcPollRate: 15s

resolver: udp://:53

dnsSvcIpRange: 100.64.0.1/10

lanIf: ens33

Regards,

Robert

1 Like

Awesome feedback. Your feedback was very helpful. Thanks heaps :slight_smile:

In reviewing your config files, one thing that I picked up on this is that port 80 and 443 are used instead of 6262 and 8441 etc.

My understanding is that this prevents port inference… is that correct?

I always wondered how that was configured.

Hi Markamind:

Just example placeholder ports. Replace with ports of your choosing I.e. 6262/8441 as long as they follow the same assignment pattern.

Regards,

Robert

1 Like

After a bit more playing around with this… I think I have a problem with my controller certificate.

After taking a detailed inspection of your post… all I needed to change was to make the port in the ctrl entry the same as the controller. In this case, my controller is setup to use port 6262… which in your case was port 80.

ctrl:

    endpoint: tls:controller-ip-address:6262

However… nothing is that simple for me… as I now get the following error message

“error connecting ctrl (x509: certificate is valid for 127.0.0.1, not controller-ip-address)”

I checked this out… a bit more… as on the controller… I have two public edge routers… which I setup for another learning experiment.

I was able to reproduce this error when I made the same changes to the same ctrl: endpoint: in one of these public edge routers.

ctrl:
  endpoint:             tls:controller-ip-address:6262

However, if I use the following entry it works because the host name is linked to 127.0.0.1 in the host name

ctrl:
  endpoint:             tls:instance-20220416-1603:6262

Now… there is a bit of history here… as I am using an Oracle Linux machine… meaning that I needed to setup the Quickstart using IP addresses… rather than a fully qualified domain name.

When I did this… I needed to modify the CSR of the public edge router to use the IP address… and re-create the edge router identify for it all to work… unfortunately… I cannot find the comments @TheLumberjack made about this…

So… I am now thinking… that the certificate used by the controller does not have the IP address in the SAN…

To solve this… I have a dreading insight that I will need to rebuild the controller …

Does this make sense… have I correctly identified the cause of my problem?

To resolve… is there a simple way to regenerate the controller certificate?

@TheLumberjack … have you experienced this before?

I just confirmed this using the following command

openssl x509 -noout -text -in controller-ip-address

So.. any tips on how to fix this would be very helpful.. especially in how to modify the controller CSR to include the IP address in the SAN..

I know that this can be configured for edge routers.. but not sure about a controller

To investigate further… I peered into the ziti-cli-functions.sh file… as maybe I could tweak the QuickStart script to get it to do what I want.

For instance… I found this code that creates the server cert from the CA


if ! test -f "${ZITI_PKI}/${ZITI_CA_NAME_local}/keys/${name_local}-server.key"; then
    echo "Creating server cert from ca: ${ZITI_CA_NAME_local} for ${name_local}"
    "${ZITI_BIN_DIR-}/ziti" pki create server --pki-root="${ZITI_PKI_OS_SPECIFIC}" --ca-name "${ZITI_CA_NAME_local}" \
          --server-file "${name_local}-server" \
          --dns "${name_local},localhost" --ip "${ip_local}" \
          --server-name "${name_local} server certificate"
  else
    echo "Creating server cert from ca: ${ZITI_CA_NAME_local} for ${name_local}"
    echo "key exists"
  fi

 

So… then… to get what I want… maybe I need to add these extra bits to the procedure that creates the CA… which is missing the SAN details

function pki_create_ca {
  cert=$1

  echo "Creating CA: ${cert}"
  if ! test -f "${ZITI_PKI}/${cert}/keys/${cert}.key"; then
    "${ZITI_BIN_DIR}/ziti" pki create ca --pki-root="${ZITI_PKI_OS_SPECIFIC}" --ca-file="${cert}" --ca-name="${cert} Root CA"
  else
    echo "key exists"
  fi
  echo " "
}

However… I am not really sure if this is appropriate given that its the CA…

Which then raises a whole bunch of more questions if that is the case.

I look forward to your further comments and suggestions

I found the link..

Actually.. I am thinking about this differently now..

does the config of the QuickStart let you have the edge router run from separate compute to the controller...

I have definitely worked out how to have two public edge routers on the same compute as the controller... but to have the controller and public edge router run on separate machines appears to require some special certificate configurations.

When I look at the docker compose example.. there is only one public edge router.. which is on the same host as the controller

https://openziti.github.io/ziti/quickstarts/network/local-docker-compose.html

I guess when you consider a public edge router is software.. it does not really matter where it is located.. as long as your server is big enough.

is the reason why the SAN is not added into the certificate authority because its not something you do
.. if this is the case.. then it appears that its not possible to have the controller and the edge router running on separate computes

The motivation to pursue this is I wanted to see I could get a bit more performance out of a micro compute by separating the controller and edge router... as the machine I am using is a micro server.

Looking forward to your your further insights.

Markamind ROOT CAs do not have sans you only need to create a new server certificate/key pair with

“${ZITI_BIN_DIR-}/ziti” pki create server --pki-root=“${ZITI_PKI_OS_SPECIFIC}” --ca-name “${ZITI_CA_NAME_local}”
–server-file “${name_local}-server”
–dns “${name_local},localhost” --ip “${ip_local}”
–server-name “${name_local} server certificate”

Just add and additional --ip . if you want to use the same certificate names as you currently have configured in your controller yaml then you will need to delete the corresponding ${name_local}-server.cert and ${name_local}-server.key from under your ziti pki root. If not then you will need to edit the yaml with the new cert names used. If you choose to use the same name you will have to delete first or you will get an error saying they already exist. In either case after creating the new cert/key pair you will need to restart the controller for the new certs to be used by the controller.

1 Like

Awesome… I really appreciate your support to help me through this… I find it quite daunting at times as I know so little of this space… but piece by piece things … the master piece is taking shape.

I will revert back once one to let you know how I get on :slight_smile:

After doing I bit more digging… I realised that I am going around in circles… and are now back at the following error message

“error connecting ctrl (x509: certificate is valid for 127.0.0.1, not controller-ip-address)”

To reproduce this error… all I need to do is to change the following entry for an edge router yaml config file that is currently working and configured to run on the same server as a controller

original

ctrl:
  endpoint:             tls:instance-20220416-1603:6262
original

ctrl:
  endpoint:             tls:controller-ip-address:6262

What I don’t understand is how to interpret this error message.

I checked the SAN of the router certificate… which does include both 127.0.0.1 and the controller-ip-address.

if this is the case, then what could be causing this error to be generated?

Here is a complete log of the edge router

[   0.137]    INFO ziti/ziti-router/subcmd.run: {routerId=[8Zo-gTL0Ck] go-version=[go1.18.4] os=[linux] arch=[amd64] build-date=[2022-07-19T20:09:36Z] configFile=[/home/opc/.ziti/quickstart/instance-20220416-1603/instance-20220416-1603-edge-router.yaml] revision=[f4124e248129] version=[v0.26.3]} starting ziti-router
[   0.137] WARNING edge/router/internal/edgerouter.(*Config).LoadConfigFromMap: Invalid heartbeat interval [0] (min: 60, max: 10), setting to default [60]
[   0.137]    INFO fabric/router/forwarder.(*Faulter).run: started
[   0.137]    INFO fabric/router/forwarder.(*Scanner).run: started
[   0.138]    INFO fabric/router.(*Router).showOptions: ctrl = {"OutQueueSize":4,"MaxQueuedConnects":1,"MaxOutstandingConnects":16,"ConnectTimeout":1000000000,"DelayRxStart":false,"WriteTimeout":0}
[   0.139]    INFO fabric/router.(*Router).showOptions: metrics = {"ReportInterval":60000000000,"MessageQueueSize":10}
[   0.139]    INFO fabric/router.(*Router).initializeHealthChecks: starting health check with ctrl ping initially after 15s, then every 30s, timing out after 15s
[   0.139]    INFO edge/router/xgress_edge.(*listener).Listen: {address=[tls:0.0.0.0:8442]} starting channel listener
[   0.140]    INFO fabric/metrics.GoroutinesPoolMetricsConfigF.func1.1: {poolType=[pool.listener.xgress_edge] minWorkers=[1] maxWorkers=[16] idleTime=[10s] maxQueueSize=[1]} starting goroutine pool
[   0.140]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [edge] at [tls:0.0.0.0:8442]
[   0.140]    INFO fabric/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.140]    INFO edge/router/xgress_edge.(*Acceptor).Run: starting
[   0.188]   FATAL ziti/ziti-router/subcmd.run: {error=[error connecting ctrl (x509: certificate is valid for 127.0.0.1, not 168.138.10.79)]} error starting

~