The absolute statement is false.

DO NOT Touch SO_RCVBUF

(NOTE: this article includes content translated by a machine)

EDIT(2018/06/02): I was too ignorant, I read the book for nothing.. See https://github.com/golang/go/issues/25701, and I mildly complained about Golang’s interface (Dialer.Control & ListenConfig.Control).. The rest of this article is just for reference, it’s just a draft.

X Me

In the previous article, I did something stupid, I directly called TCPConn.SetWriteBuffer in the code, because when transmitting large data blocks in the same datacenter, the data metrics significantly improved, and I also set RCVBUF … Actually, I wanted to change the system configuration directly, but I won’t be wise after the event ..

Without sufficient testing, I directly launched it into the production environment on impulse. After a long time, everyone felt something was wrong. The data looked very bad when it was across the datacenter, and I couldn’t figure it out for a while.. A few days ago, I wrote a program to test various scenarios (professional testing O(∩_∩)O~), and found that it was caused by the data receiver setting TCPConn.SetReadBuffer, and TCPConn.SetWriteBuffer seemed to have no use. The TCP/IP protocol stack has done a lot of optimization, and our optimization seems to be self-smart.. After removing this so-called “optimization”, the latency dropped significantly, and the large data block (5M) dropped from 800ms to about 100ms ..

X Me Hard

After tcpdump, I found that the sender often stops after sending some packets, waiting for the other party to ack before continuing to send. The win of the other party can only reach about 356 in the case of a custom buffer, and after multiplying by 2^wscale, it is about 200K. In the case of not setting, there is basically no pause and the win will rise to the maximum value of 16384, which is 8M(16384*(1«9)), half of sysctl_net_ipv4_tcp_rmem[2]. Why is it half, see net.ipv4.tcp_adv_win_scale, you can Google for details.. Then perf_event + FlameGraph found that the frequency of sk_stream_wait_memory is very high, guessing that it is caused by the small SNDBUF of the sender, and then after increasing the test, I found that sk_stream_wait_memory on the flame graph is similar to the situation without setting the buffer, but the effect has not improved..

After testing again, I found that the effects of changing system configuration and directly using TCPConn.SetReadBuffer are completely different. The probably reason is that the user-defined buffer disables some optimizations of the kernel. I tried to flip the linux(v3.10) kernel code.. I found that the user setting RCVBUF will add a SOCK_RCVBUF_LOCK flag, and then tcp_clamp_window and tcp_rcv_space_adjust find this flag will skip some unknown operations, I need to study the TCP/IP protocol stack code carefully when I have time (/ω╲) ..

Since I’m here, I thought about whether I could optimize it a bit more. There will be a “slow” startup process after establishing a connection. After sending 10*mss data for the first time (linux 2.6.x? After adjusting #define TCP_INIT_CWND 10), I will wait for the other party to ack before continuing to send. There will be several pauses later, and then it will be very smooth. Each pause is about 20ms (network delay), so I thought about increasing the size of the initial window. After using ip route to adjust the cwnd and rwnd of the sender and receiver, there was no significant improvement, so I didn’t try again. During the period, I typed the wrong command once and directly isolated a production environment machine from the network..

EDIT(TODO): Today I suddenly thought about it, turned back and sorted it out. According to theory, the maximum speed of a single connection should be min(cwnd, rwnd) / rtt, the maximum value of the sender’s cwnd is the receiver’s rwnd, and the rwnd is net.ipv4.tcp_rmem The maximum value is half of 8M (net.ipv4.tcp_adv_win_scale), the maximum rate is 8MiB/0.026s ~= 307MiB/s, but the actual rate is only 50MiB/s, and the rwnd of the receiver is already at its maximum, indicating that cwnd has not risen for some reason; then according to theory calculation, assuming that cwnd needs to rise to 8M, then it is ~5746 segments, according to the best case, from 10 exponential increase to 5746, the required time is log2(1 + 5746/10) * 26ms ~= 238.4ms, according to the test situation the connection has been preheated, and it is a 10G network card plus tens of G of dedicated line, it cannot explain why cwnd has not risen, unless the network is very unstable.. Then you have to try other congestion control algorithms, unfortunately, the previous datacenter was offline, and there is no previous environment, and I will look back when I have time

So Much to Learn

The problem is solved, the process is violent, it is still quite uncomfortable to have only data support without theoretical support, I have to find time to study the TCP/IP protocol stack..

Don’t be too clever

Of course, if a great guy can wake me up with a stick, I would be very grateful

Refs