Hey John,
Thanks again for the logs and captures. I had some time to look at the latest yesterday. I see evidence of connections being stuck in FIN_WAIT_1 (on the gateway side). I was hoping that keepalives would help with this, but for some reason I see no sign of keepalive activity in your logs. For example, I see lots of this when I run the same build that you ran:
tcp_keepalive: sending KEEPALIVE probe to 100.64.10.1
tcp_keepalive: tcp_ticks 34282 pcb->tmr 34221 pcb->keep_cnt_sent 0
tcp_keepalive: seqno 4033391 ackno 2627159563 err 0.
But I don’t see it at all in your logs. I’ll dive into the lwip logic that leads to keepalives being sent today.
I also saw something that could be packet loss at/in ziti-edge-tunnel
’s packet reader. Here’s an example where the ziti connection closes first:
[ 249.523] DEBUG tunnel-cbs:ziti_tunnel_cbs.c:126 on_ziti_data() ziti connection sent EOF (ziti_eof=0, tnlr_eof=0)
[ 249.523] DEBUG tunnel-sdk:ziti_tunnel.c:466 ziti_tunneler_close_write() closing write connection: client[tcp:#.#.179.#:58804] service[ZITI_SERVICE]
[ 249.523] DEBUG tunnel-sdk:tunnel_tcp.c:243 tunneler_tcp_close_write() closing write tcp local=#.#.171.#:8301 remote=#.#.179.#:58804 state=4(ESTABLISHED) flags=0x100 service=ZITI_SERVICE
[ 249.523] DEBUG tunnel-sdk:tunnel_tcp.c:255 tunneler_tcp_close_write() closed write tcp local=#.#.171.#:8301 remote=#.#.179.#:58804 state=5(FIN_WAIT_1) flags=0x120 service=ZITI_SERVICE
This sequence of log messages basically means that ziti-edge-tunnel
sent FIN
to the client and is waiting for an ACK
. There is no sign in the logs of that ACK, but the capture shows it (frame 2378):
No. Time. Source Destination Proto Len Info
2285 236.448286 #.#.179.# #.#.171.# TCP 60 58804 → 8301 [SYN] Seq=845976624 Win=62727 Len=0 MSS=8460 SACK_PERM TSval=1766677971 TSecr=0 WS=128
2286 236.479739 #.#.171.# #.#.179.# TCP 48 8301 → 58804 [SYN, ACK] Seq=13233 Ack=845976625 Win=65535 Len=0 MSS=32768 WS=16384
2287 236.480851 #.#.179.# #.#.171.# TCP 40 58804 → 8301 [ACK] Seq=845976625 Ack=13234 Win=62848 Len=0
2288 236.481697 #.#.179.# #.#.171.# TCP 743 [TCP Previous segment not captured] 58804 → 8301 [PSH, ACK] Seq=845985586 Ack=13234 Win=62848 Len=703
2289 236.481712 #.#.171.# #.#.179.# TCP 40 [TCP Window Update] 8301 → 58804 [ACK] Seq=13234 Ack=845976625 Win=49152 Len=0
2376 246.479420 #.#.171.# #.#.179.# TCP 40 8301 → 58804 [FIN, ACK] Seq=13234 Ack=845976625 Win=49152 Len=0
2377 246.480506 #.#.179.# #.#.171.# TCP 40 58804 → 8301 [FIN, ACK] Seq=845986289 Ack=13235 Win=62848 Len=0
2378 246.480550 #.#.171.# #.#.179.# TCP 40 [TCP Dup ACK 2286#1] 8301 → 58804 [ACK] Seq=13235 Ack=845976625 Win=49152 Len=0
So tcpdump apparently sees the FIN that should advance the lwip connection out of FIN_WAIT_1.
Unfortunately I didn’t enable the logs that would show the packets that ziti-edge-tunnel
receives in this build. Nor did I enable the tcp state machine logs. Both of these are incredibly voluminous, but that may end up being necessary to understand whether the packet simply isn’t being read or if it’s being ignored by lwip for some reason. I’m still exploring options, but wanted to give you a quick update.