Are the errors themselves the problem, or is the logging of errors slowing things down? If you set the log level to fatal, does that resolve the issue?
You use the following to set the log level at runtime.
The errors only happen after a circuit is complete and is being torn down. So they are not slowing down services, unless the logging is being overwhelmed by an excess of messages.
Unfortunately the ziti-router 1.6.7 generates much more errors, approximately by a factor of several hundreds. This is why the services are slow compared to 1.5.4
They both work with ziti-controller 1.6.7.
The only way to get back to normal is to downgrade the ziti-router to 1.5.4
There is some misunderstanding: hiding errors will not be helpful.
I am trying to understated why router 1.5.4 runs smoothly but 1.6.7 has such difficulty to handle the payload without disruption.
Clearly every host can handle these additional syslog messages without any noticeable impact on performance. But these errors lead to retransmission/permanent data loss. This is why the services are slower.
What ziti component is hosting the the service (ER/T, ZET, SDK (which sdk))?
What ziti component is on the client side (ER/T, ZET, SDK (which sdk))?
What does the traffic look like? If you want to be specific about what software is going over Ziti, that's helpful, but need to know protocol, traffic patterns, etc. Is it TCP/UDP? Is it HTTP/SSH, etc? Are you doing request/response/close or is it back and forth? Are you using TCP half-close?
Can you quantify the issue? What throughput/latency are using seeing on 1.5.4 vs 1.6.7? Can you grab metrics and compare retransmission rates between the two? Are you seeing connections be unexpected terminated?
If you can provide specific instructions on how to reproduce the errors, that would be the most helpful, but if you can describe the data flows in detail, I may be able to reproduce the issue.
I did weeks of data flow testing before we released 1.6.7, and the test cases I have are working fine, so we need to figure what's different about your network traffic.
Ok, we're making some progress on understanding the scenario.
So we've got zrok on the front-end and back-end.
Are you running lots of connections over the same front-end and back-end or are you just loading down the one or the other?
You are self-hosting zrok, correct?
Can you either tell me what software you're running over zrok, or describe the traffic patterns? How much traffic is getting send in each direction, is it request/response or is it uncoupled, how large are payloads, etc,etc.
Do you have the 'superNetwork' setting set to true?
Yes, you are right I run self-hosted zrok. There is a variety of applications: zrok http proxy, tcp tunneling, socks, vpn.
After your explanation I have done more tests. Sometime there is a relatively large number of connections to the same service. So zrok service can not handle the load.
I don’t use superNetwork.
As I understand this why the router sends a large number of errors:“unable to forward payload”.
The router simply saying that the service gives up and close the connection.