Hi Paul,
Recently I'm running tests(VU=200, with 1000 subsequent requests for each VU after authentication) for deployment of separating router & controller to different PCs.
And I found that under high loading, the router seems to be killed by sth. This situation didn't happen while router & controller was deployed on the same PC and docker network(from quickstart guide).
Below is the last log from router before it was killed:
...
...
[4082.197] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] circuitCount=[1044]} sent confirmation for circuits
[4159.134] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{7kMr}]: {error=[channel closed]} error sending latency response
[4159.428] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{erNG}]: {error=[channel closed]} error sending latency response
[4159.792] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{rzlj}]: {error=[channel closed]} error sending latency response
[4160.076] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{Gkmy}]: {error=[channel closed]} error sending latency response
[4160.269] ERROR channel/v2.(*heartbeater).sendHeartbeat: {error=[context deadline exceeded] channelId=[ch{ctrl}->u{reconnecting}->i{ZRXa}]} pulse failed to send heartbeat
[4160.598] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{l5gN}]: {error=[channel closed]} error sending latency response
[4160.625] ERROR channel/v2.(*reconnectingImpl).Tx [u{reconnecting}->i{ZRXa}]: tx error (write tcp 172.20.0.2:42762->10.17.64.119:6262: write: broken pipe). starting reconnection process
[4160.633] INFO channel/v2.(*reconnectingImpl).pingInstance [u{reconnecting}->i{ZRXa}]: starting
[4160.636] INFO channel/v2.(*reconnectingImpl).pingInstance [u{reconnecting}->i{ZRXa}]: exiting
[4160.644] ERROR channel/v2/latency.(*LatencyHandler).HandleReceive.func1 [ch{edge}->u{classic}->i{nkVm}]: {error=[channel closed]} error sending latency response
[4160.656] ERROR ziti/router/xgress_edge.(*edgeClientConn).processConnect [ch{edge}->u{classic}->i{4ZyN}]: {token=[a0f414c1-7b71-4ffd-91eb-6614f38c985c] connId=[4] type=[EdgeConnectType] error=[mux is closed, can't add sink with id [4]] chSeq=[520] edgeSeq=[0]} error adding to msg mux
[4160.681] ERROR channel/v2.(*reconnectingImpl).Rx [u{reconnecting}->i{ZRXa}]: {error=[tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.20.0.2:42762->10.17.64.119:6262: write: broken pipe]} error closing peer after rx error
[4160.684] ERROR channel/v2.(*reconnectingImpl).Rx [u{reconnecting}->i{ZRXa}]: {error=[EOF]} rx error. closed peer and starting reconnection process
[4160.791] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[2m18.16s] idleThreshold=[1m0s] circuitId=[lfMZ3-rPBy]} circuit exceeds idle threshold
[4160.825] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[Fi643-rPBX] ctrlId=[ziti-controller] idleTime=[1m16.122s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.826] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[OQf13P6PB] ctrlId=[ziti-controller] idleTime=[6m13.638s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.826] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[yfRZ3-rPgU] ctrlId=[ziti-controller] idleTime=[2m18.147s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.826] WARNING ziti/router/forwarder.(*Scanner).scan: {idleThreshold=[1m0s] circuitId=[y3MZ3-6PBI] ctrlId=[ziti-controller] idleTime=[2m18.151s]} circuit exceeds idle threshold
[4160.826] WARNING ziti/router/forwarder.(*Scanner).scan: {idleThreshold=[1m0s] ctrlId=[ziti-controller] circuitId=[1E113P6-gX] idleTime=[6m19.241s]} circuit exceeds idle threshold
[4160.826] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[XfRm3-r-Bx] ctrlId=[ziti-controller] idleTime=[2m18.185s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.827] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] circuitId=[rsLffPr-gW] idleTime=[3m19.208s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.830] WARNING ziti/router/forwarder.(*Scanner).scan: {idleThreshold=[1m0s] circuitId=[pfMmfP6PBt] ctrlId=[ziti-controller] idleTime=[2m18.091s]} circuit exceeds idle threshold
[4160.830] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleThreshold=[1m0s] idleTime=[1m16.109s] circuitId=[JI643-6PB]} circuit exceeds idle threshold
[4160.830] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[3m19.243s] idleThreshold=[1m0s] circuitId=[PPWffP6Pgy]} circuit exceeds idle threshold
[4160.830] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[1rt63-r-g] ctrlId=[ziti-controller] idleTime=[3m36.191s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.830] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[1m16.17s] idleThreshold=[1m0s] circuitId=[BFr4f-6Pgs] ctrlId=[ziti-controller]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[6m13.189s] idleThreshold=[1m0s] circuitId=[y6f13-r-B] ctrlId=[ziti-controller]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[1m16.15s] idleThreshold=[1m0s] circuitId=[2Q643P6PBX]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[6m19.299s] idleThreshold=[1m0s] circuitId=[kaTT3PrPBw]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[6m12.823s] idleThreshold=[1m0s] circuitId=[khZ1fP6-g] ctrlId=[ziti-controller]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[3m52.693s] idleThreshold=[1m0s] circuitId=[xqW63Pr-B] ctrlId=[ziti-controller]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[3m34.498s] circuitId=[pFc6f-6Pg] ctrlId=[ziti-controller] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[3m44.93s] circuitId=[K-Nrf-6-B] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {ctrlId=[ziti-controller] idleTime=[1m16.117s] circuitId=[Ti643P6-gl] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {idleTime=[6m19.347s] idleThreshold=[1m0s] circuitId=[05113-rPBy] ctrlId=[ziti-controller]} circuit exceeds idle threshold
[4160.831] WARNING ziti/router/forwarder.(*Scanner).scan: {circuitId=[CErefPr-BX] ctrlId=[ziti-controller] idleTime=[1m16.14s] idleThreshold=[1m0s]} circuit exceeds idle threshold
[4160.937] ERROR ziti/router/xgress_edge.(*edgeClientConn).sendStateClosedReply: {edgeSeq=[0] error=[channel closed] connId=[4] type=[EdgeStateClosedType] chSeq=[514]} failed to send state response
/var/openziti/scripts/run-router.sh: line 75: 15 Killed "${ZITI_BIN_DIR}/ziti" router run "${ZITI_HOME}/${ZITI_ROUTER_NAME}.yaml"
I saw there was a log [4160.269] ERROR channel/v2.(*heartbeater).sendHeartbeat: {error=[context deadline exceeded] channelId=[ch{ctrl}->u{reconnecting}->i{ZRXa}]} pulse failed to send heartbeat
, is it highly relevant with the self-termination?
Here is my docker-compose.yml for my router:
services:
ziti-edge-router:
user: "1000"
image: "${ZITI_IMAGE}:${ZITI_VERSION}"
env_file:
- ./.env
ports:
- ${ZITI_INTERFACE:-0.0.0.0}:${ZITI_ROUTER_PORT:-3022}:${ZITI_ROUTER_PORT:-3022}
- ${ZITI_INTERFACE:-0.0.0.0}:${ZITI_ROUTER_WSS_PORT:-3023}:${ZITI_ROUTER_WSS_PORT:-3023}
- ${ZITI_INTERFACE:-0.0.0.0}:${ZITI_ROUTER_LISTENER_BIND_PORT:-10080}:${ZITI_ROUTER_LISTENER_BIND_PORT:-10080}
environment:
- ZITI_CTRL_ADVERTISED_ADDRESS=${ZITI_CTRL_ADVERTISED_ADDRESS:-ziti-controller}
- ZITI_CTRL_ADVERTISED_PORT=${ZITI_CTRL_ADVERTISED_PORT:-6262}
- ZITI_CTRL_EDGE_ADVERTISED_ADDRESS=${ZITI_CTRL_EDGE_ADVERTISED_ADDRESS:-ziti-edge-controller}
- ZITI_CTRL_EDGE_ADVERTISED_PORT=${ZITI_CTRL_EDGE_ADVERTISED_PORT:-1280}
- ZITI_ROUTER_NAME=${ZITI_ROUTER_NAME:-ziti-edge-router}
- ZITI_ROUTER_ADVERTISED_ADDRESS=${ZITI_ROUTER_ADVERTISED_ADDRESS:-ziti-edge-router}
- ZITI_ROUTER_PORT=${ZITI_ROUTER_PORT:-3022}
- ZITI_ROUTER_LISTENER_BIND_PORT=${ZITI_ROUTER_LISTENER_BIND_PORT:-10080}
- ZITI_ROUTER_ROLES=public
- ZITI_ROUTER_IP_OVERRIDE=${EXTERNAL_IP}
- NODE_EXTRA_CA_CERTS=${NODE_EXTRA_CA_CERTS:-}
- ZITI_PKI_ALT_SERVER_CERT=${ZITI_PKI_ALT_SERVER_CERT:-}
- ZITI_PKI_ALT_SERVER_KEY=${ZITI_PKI_ALT_SERVER_KEY:-}
networks:
- ziti
volumes:
- ./ziti-fs/:/persistent/
- ./dev_certs/:/etc/certs/
- ./ziti-bins/:/openziti/ziti-bin
entrypoint: /bin/bash
command: "/var/openziti/scripts/run-router.sh edge"
networks:
ziti:
#volumes:
# ziti-fs:
I'm wondering if there is anything I can do to prevent from this situation? Is it possible that I can configure the heartbeat timeout?
I saw logs from controller:
[89094.433] ERROR ziti/controller/handler_ctrl.(*heartbeatCallback).CheckHeartBeat: {channelType=[router] channelId=[.5eXAZqVrh]} heartbeat not received in time, closing control channel connection
[89094.434] INFO ziti/controller/handler_ctrl.(*xctrlCloseHandler).HandleClose [ch{.5eXAZqVrh}->u{classic}->i{aVNa}]: closing Xctrl instances
[89094.434] WARNING ziti/controller/handler_ctrl.(*closeHandler).HandleClose: {routerId=[.5eXAZqVrh]} disconnected
[89094.434] INFO ziti/controller/network.(*routerChangedEvent).handle: {routerId=[.5eXAZqVrh] connected=[false]} calculating router updates for router
[89094.434] INFO ziti/controller/env.(*Broker).RouterDisconnected.func1: {routerName=[router2] routerFingerprint=[0xc0065c6d60] routerId=[.5eXAZqVrh]} broker detected router with id .5eXAZqVrh disconnecting
[89094.434] INFO ziti/controller/sync_strats.(*InstantStrategy).RouterDisconnected: {sync_strategy=[instant] routerId=[.5eXAZqVrh] routerName=[router2] routerFingerprint=[121b433e8af8b9f8c4e6f6a358ddb8ca85c3325f]} edge router [.5eXAZqVrh] disconnect event, router rtx removed
Thanks!
About the performance result of separating controller & router, I found an interesting symptom.
If I restart router before each round test, I can easily get relatively higher(almost by ten times) RPS than without restarting router.
Here is my result csv:
- without restarting router before each round(sleep 20s between each round, Client.Timeout set to 30s)
subseqreqs,round,vus_max,success_rate,checks_pass,checks_fails,http_reqs,recv_avg,recv_med,recv_p90,router_cpu_usage%,router_mem_usage%,controller_cpu_usage%,controller_mem_usage%
1000,1,200,2912.1833333333334,524193,1107,525300,0.01997380120883363,0.017289,0.03476910000000004,81.73888888888743,28.278586175669517,23.633333333334832,3.285104235341707
1000,2,200,1264.388888888889,227590,1085,228675,0.023423838854268495,0.019292,0.033446,83.55555555555435,70.79598548208621,22.988888888888454,3.2806255281422336
1000,3,200,20631.722222222223,3713710,0,3713710,0.012002000709532605,0.007832,0.025244,85.28300628903382,6.009228731519823,27.26388888888728,4.338284759014388
1000,4,200,4269.322222222222,768478,986,769464,0.01907823072424435,0.016734,0.034548,81.69444444443798,21.68988339219917,3.3694444444457474,3.331525760509535
1000,5,200,1625.1833333333334,292533,1360,293893,0.022204870139812875,0.020581,0.034079,83.00517789285209,61.92495597068538,2.517278161730772,3.265047814629185
1000,6,200,1577,283860,1310,285170,0.023525204597258337,0.021283,0.035495,82.70000000000033,95.1559441852859,2.568831803731058,3.285443346186956
1000,9,200,5146.155555555555,926308,921,927229,0.017954461989432867,0.013958,0.03332320000000007,80.92820159552721,16.803491461231424,3.2805555555508903,3.371754693935658
1000,7,200,12357.33888888889,2224321,345,2224667,0.014231163109804488,0.009247,0.029168,83.95519900441974,7.335058466447453,26.263888888893746,4.372681596912686
1000,8,200,3518.822222222222,633388,1027,634415,0.02071548012263244,0.019113,0.035976,81.5444444444438,21.011809781539238,3.9472222222280635,3.394074908669198
1000,10,200,1625.3055555555557,292555,1360,293916,0.02244683817825517,0.0207485,0.034423,82.5781649296679,52.08347135465474,2.5533900753301424,3.3005444714850563
set Client.Timeout to 180s(this helps to reducing request errors & router self-termination)
subseqreqs,round,vus_max,success_rate,checks_pass,checks_fails,http_reqs,recv_avg,recv_med,recv_p90,router_cpu_usage%,router_mem_usage%,controller_cpu_usage%,controller_mem_usage%
1000,11,200,17521.03888888889,3153787,0,3153787,0.012693815892131963,0.007965,0.026576,89.144444444448,8.558243437840407,27.305555555548295,4.343673871906528
1000,12,200,3297.461111111111,593543,0,593543,0.018924504687950396,0.015005,0.03450280000000005,80.06622369385727,22.1718846319898,2.655555555563012,3.3605548708843136
1000,13,200,472.2888888888889,85012,0,85012,0.02125048080271018,0.019427,0.03313790000000001,77.44444444445739,35.85280429360627,2.258333333334633,3.3095049409366695
1000,14,200,372.22222222222223,67000,0,67000,0.019862711238806043,0.018182,0.031417,72.47777777777326,51.27760744544268,2.2833333333296224,3.3083776265051434
- with restarting router for each round(Client.Timeout set to 180s)
subseqreqs,round,vus_max,success_rate,checks_pass,checks_fails,http_reqs,recv_avg,recv_med,recv_p90,router_cpu_usage%,router_mem_usage%,controller_cpu_usage%,controller_mem_usage%
1000,15,200,14860.594444444445,2674907,0,2674907,0.011936543716473885,0.007559,0.025392,81.7448501144419,9.398568551629493,26.1321970622854,4.391204381189917
1000,16,200,16427.62777777778,2956973,0,2956973,0.011946431934278499,0.007427,0.025549,87.1888888888889,9.348776205477193,27.551167803729598,4.410933911267834
1000,17,200,5425.9,976662,0,976662,0.015532077946106108,0.009869,0.030533,81.70514900331122,17.54594355233756,3.7916666666610155,3.4215642725033635
1000,18,200,20106.977777777778,3619256,0,3619256,0.011362545774048133,0.007388,0.024219,80.6277777777776,9.086195002762064,26.290526877184377,4.386090223037176
1000,19,200,9662.822222222223,1739308,0,1739308,0.013278769226037832,0.008183,0.027637,80.59401320029318,10.288889195374896,21.75381675148128,4.131818190150138
1000,20,200,16734.26111111111,3012167,0,3012167,0.011419567380893336,0.007247,0.024619,85.05522344940968,8.863530181907386,27.232172618381895,4.389075009485943
And whenever I got higher of RPS, the controller's CPU usage will be higher, too.
So I started to think, for those cases with lower RPS and lower controller CPU usage, does it mean controller was waiting on sth? Will it be controller's db writes? But how come restarting router can help on this?
Best Regards,
Rick Wang