One more data point: I also ran 2 iso downloads concurrently along with the google curls, and that worked fine as well.
$ curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 2287M 0 5680k 0 0 850k 0 0:45:53 0:00:06 0:45:47 848k^C
$ curl -4 -I --proxy socks5://127.0.0.1:4141 https://www.google.com:443
curl: (97) Recv failure: Connection reset by peer
The second scenario includes only the curl command followed by Ctrl-C
$curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
1 2287M 1 41.6M 0 0 1588k 0 0:24:34 0:00:26 0:24:08 1675k^C
zrok share private -b socks --headless
[ 1.223] INFO sdk-golang/ziti.(*listenerManager).createSessionWithBackoff: {session token=[cbbe1384-67c6-45d8-a80f-4b754447c592]} new service session
[ 1.289] INFO main.(*sharePrivateCommand).shareLocal: allow other to access your share with the following command:
zrok access private f9js0ymsxlh6
[ 59.997] INFO main.(*sharePrivateCommand).shareLocal: -> CONNECT 172.217.20.196:443
[ 83.902] INFO main.(*sharePrivateCommand).shareLocal: -> CONNECT 134.158.69.171:443
[ 136.022] ERROR channel/v4/latency.ProbeLatencyConfigurable [ch{ziti-sdk[router=tls:router.domain.name:443]}->u{classic}->i{wl1mwBnP8p/yXwL}]: {error=[timeout waiting for message reply: context deadline exceeded]} unexpected error sending latency probe
[ 136.023] ERROR sdk-golang/ziti.(*ContextImpl).connectEdgeRouter.func1.3: latency timeout after [10s]
[ 136.023] ERROR sdk-golang/ziti.(*ContextImpl).connectEdgeRouter.func1.3: no read traffic on channel since before latency probe was sent, closing channel
[ 136.023] INFO zrok/endpoints/socks.(*Server).Serve.func1: client connection failed: from backend to client: writeto tcp XXX.XXX.XXX.XXX:54526->134.158.69.171:443: channel closed
[ 136.023] ERROR sdk-golang/ziti/edge/network.(*edgeConn).close: {marker=[0C74noVk] circuitId=[6kINEvitWsCk7WJmIltiHT] error=[channel closed] connId=[2147483650]} failed to send close message
[ 136.023] ERROR sdk-golang/ziti/edge/network.(*edgeListener).close: {connId=[1] sessionId=[cbbe1384-67c6-45d8-a80f-4b754447c592] error=[channel closed]} unable to unbind session for conn
[ 136.023] ERROR sdk-golang/ziti/edge/network.(*edgeHostConn).close: {serviceName=[f9js0ymsxlh6] connId=[1] marker=[] error=[channel closed]} failed to close listener
[ 136.024] ERROR channel/v4.(*channelImpl).tx [ch{ziti-sdk[router=tls:router.domain.name:443]}->u{classic}->i{wl1mwBnP8p/yXwL}]: {error=[write tcp XXX.XXX.XXX.XXX:34040->XXX.XXX.XXX.XXX:443: use of closed network connection]} write error
[1986.709] WARNING sdk-golang/ziti.(*ContextImpl).createSession: {error=[error for request ZbGVys1BL: NOT_FOUND: The resource requested was not found or is no longer available, caused by: error for request : UNHANDLED: service with id 3fS6OghcRdLq52xPGsY3Ks not found] errorType=[*rest_util.APIFormattedError]} failure creating Bind session to service f9js0ymsxlh6
goroutine 396 gp=0xc0017cce00 m=nil [IO wait]
This goroutine is blocked because the TCP/TLS socket’s send buffer is full and the peer isn’t reading?
$netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp6 0 558453 XXX.XXX.XXX.XXX:443 zrok.access:32888 ESTABLISHED
tcp 5683880 0 XXX.XXX.XXX.XXX:54526 mirror.in2p3.fr:443 ESTABLISHED
Might be a timeout would be helpful?
conn.SetWriteDeadline(…)
There is a dead lock even after I have terminated zrok access.
# netstat -tnp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 4113781 XXX.XXX.XXX.XXX:57814 XXX.XXX.XXX.XXX:443 ESTABLISHED 3254616/zrok
tcp 5801080 0 XXX.XXX.XXX.XXX:59610 134.158.69.171:443 ESTABLISHED 3254616/zrok
tcp6 5848356 0 XXX.XXX.XXX.XXX:443 XXX.XXX.XXX.XXX:57814 ESTABLISHED 755742/ziti
The corresponding zrok share log is here
goroutine 893 : txer blocks, which in turn blocks any goroutines waiting in WaitForWire
goroutine 397 : The TLS socket’s send buffer is full
I am not sure that the router works properly.
Might be your local network is too fast (Speed 11.2 MB/s)
My router is on 100Mbps network (fast net) but zrok access connects to the router at 1.3MB/s (slow net)
curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 2287M 0 10.5M 0 0 1435k 0 0:27:11 0:00:07 0:27:04 1591k
1 2287M 1 23.5M 0 0 1377k 0 0:28:20 0:00:17 0:28:03 1255k
2 2287M 2 55.8M 0 0 1315k 0 0:29:40 0:00:43 0:28:57 1270k
3 2287M 3 87.6M 0 0 1291k 0 0:30:13 0:01:09 0:29:04 1174k
Can you also grab the stacks from the router, please? Need to see why the write to the router isn't being processed.
You can either do:
ziti fabric inspect stackdump <router id> > /path/to/dump.file
to get it via the controller, or
ziti agent stack -t router > /path/to/dump.file
on the box where the router is running, to get it via IPC. If you do it this way, you'll need to to run the ziti command as the same user as the router is running or as root.
Thank you,
Paul
Can you also run ziti fabric list circuits after you Ctrl-C the file transfer? If you see a circuit left around that shouldn't be there, can you run:
ziti fabric inspect circuit <circuit id>
and post the results up, please?
Thank you,
Paul
So you've got
zrok access -> fast -> router -> slow -> zrok share ?
I'll see if I can test with a similar environment.
Paul
Precisely this: internet <100Mbps> zrok share & router <20Mbps> zrok access
Zrok share is local to the router
I have done the snapshots you have asked: test1 and test2
The before file corresponds to the command ziti fabric inspect stackdump router just before the curl command.
curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
1 2287M 1 24.5M 0 0 1665k 0 0:23:26 0:00:15 0:23:11 1690k^C
The after file corresponds to the command ziti fabric inspect stackdump router just after the curl
#netstat -tnp
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 3781886 XXX.XXX.XXX.XXX:45816 XXX.XXX.XXX.XXX:443 ESTABLISHED 3332526/zrok
tcp 5820632 0 XXX.XXX.XXX.XXX:60440 134.158.69.171:443 ESTABLISHED 3332526/zrok
tcp6 5908037 0 XXX.XXX.XXX.XXX:443 XXX.XXX.XXX.XXX:45816 ESTABLISHED 755742/ziti
Here, the zrok share is sending, the OS has the data, but the router never consumes it — so the buffer fills and everything halts.
I see the deadlock in the router, thank you for providing all the debug information. Should hopefully have a fix for you soon.
Let me know if you'd be willing to test a custom build with a fix.
Thank you,
Paul
I would like to try your fix. Unfortunately I am not a developer. Thus I do not have enough skills to run the build.
if you send me a link to the binary then I can copy it on the host and will let you know about the result.
file /opt/openziti/bin/ziti
/opt/openziti/bin/ziti: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=923eda69fc72dfbca4faf9899307eb20b36c49ad, for GNU/Linux 3.2.0, with debug_info, not stripped
I've uploaded the linux/amd64 build to Google Drive here: https://drive.google.com/file/d/1AKIVTZCOZ9Ln3qOoLHk2daqDMZfX097L/view?usp=drive_link
Let me know if that works for you.
I had also linked the CI build, but @TheLumberjack pointed out that it's not accessible to non-project members.
If you get a chance to test, that would be very appreciated.
Thank you,
Paul
I have deployed ziti 1.6.10 on all my routers. The controller runs ziti 1.6.7.
I have done the same test. This version is much better. Here is the log of my third test.
However the router is still blocked for a few of seconds (8-11s)
curl -4 --proxy http://127.0.0.1:8181 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 2287M 0 5520k 0 0 1113k 0 0:35:03 0:00:04 0:34:59 1113k^C
$ curl -4 -w '%{time_total}s' -s -o /dev/null -I --proxy http://127.0.0.1:8181 https://www.google.com:443
8.150145s
That means the goroutine (which drains data from the TCP socket) is stuck inside HandleReceive at line 70 of inspect.go. In router-after_test3.txt, the blocked routine
router/handler_ctrl.(*inspectHandler).HandleReceive appears at line 10.
This is the same root cause as in my earlier trace.
I would like to thank you for this fix. After the halt (10s) the router gets back to its normal state.
I don’t know is it possible to eliminate the halt entirely?
Hello,
Thank you for testing! I'll be putting out the 1.6.10 release with fix, ideally in the next day or two.
I was able to reproduce the delay you're seeing. I had to use tc to add latency and packet drops.
I'm going to dig into it and see if I can figure out what's going on. In the meantime, would you be willing to trying setting the superNetwork option, as described in the 1.0.8 release notes? Release v1.0.8 · openziti/zrok · GitHub
If I set that option, I don't see the delay eliminated, but I do see a reduction.
I'll let you know what I find.
Thank you,
Paul
Hmm, I made a mistake in testing and it was actually a longer delay with the superNetwork option enabled. Without it, the delay was generally a few seconds. I did find some changes that brought the delay down by maybe 50%, but didn't eliminate it entirely. I'll have to dig in a bit more to see if I can make an additional improvements.
Paul
Enabling superNetwork allows me to run multiple commands in parallel, which I understand is the intended purpose.
However, after starting the first download in one terminal window, the second download takes about 30 seconds before it begins receiving data. This issue has already been reported earlier.
Once this initial delay has passed, both downloads do proceed in parallel.
Because of this delay, setting superNetwork is unfortunately not very helpful in practice.
$curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 2287M 0 17.1M 0 0 1289k 0 0:30:16 0:00:13 0:30:03 1410k
1 2287M 1 33.8M 0 0 1350k 0 0:28:55 0:00:25 0:28:30 1449k
2 2287M 2 60.2M 0 0 1267k 0 0:30:47 0:00:48 0:29:59 829k
3 2287M 3 69.9M 0 0 1161k 0 0:33:36 0:01:01 0:32:35 737k
3 2287M 3 82.8M 0 0 1079k 0 0:36:10 0:01:18 0:34:52 837k
5 2287M 5 126M 0 0 972k 0 0:40:07 0:02:13 0:37:54 888k
$ curl -4 --proxy socks5://127.0.0.1:4141 -o /dev/null https://mirror.in2p3.fr/pub/fedora/linux/releases/42/Workstation/x86_64/iso/Fedora-Workstation-Live-42-1.1.x86_64.iso
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:21 --:--:-- 0
0 2287M 0 10.0M 0 0 237k 0 2:44:03 0:00:43 2:43:20 832k
0 2287M 0 17.5M 0 0 336k 0 1:55:51 0:00:53 1:54:58 796k
1 2287M 1 27.4M 0 0 423k 0 1:32:16 0:01:06 1:31:10 777k
1 2287M 1 36.2M 0 0 479k 0 1:21:24 0:01:17 1:20:07 767k
2 2287M 2 57.2M 0 0 561k 0 1:09:29 0:01:44 1:07:45 796k
3 2287M 3 70.2M 0 0 597k 0 1:05:21 0:02:00 1:03:21 800k
zrok v1.1.2 + ziti-router v1.6.10 + ziti-controller v1.6.7