Pitfalls of etcd « damnever's blog

The previous article has become increasingly bulky, so I’ve decided to extract it and start a new one. The so-called “pitfalls” are actually due to the incomplete documentation of etcd or the high expectations of the users..

One should understand what is different about Revision, ModRevision and Version? when using etcd. If consistency is required, one should also understand linearizable read and serializable read. The default in etcd is the strongest, but it will definitely have some impact on performance because all operations have to indirectly or directly go through the Leader and follow the raft process.

Pitfall One

Lease.KeepAlive will be cancelled if it times out just when there is no Leader (during election), causing our node to “disappear”. You’ll have to write your own retry logic. Other pitfalls at the API level can be found in Pitfall Four, and APIs related to keepalive should also be noted, such as concurrency which needs to continually monitor channels like Done().

Further thinking, how does keepalive achieve clock synchronization? Within a term, Lease timeout is not a problem because a single machine’s monotonic clock is easy to guarantee. But if a re-election occurs within the Lease timeout interval, how should the new Leader timeout this Lease without clock synchronization due to clock skew? Does the new Leader directly submit a delete entries? Then, before the next renew on the client side, the entries associated with this Lease will disappear for a while… In fact, etcd’s implementation is to directly renew after the new election. The potential problem is that if the election is too frequent and the Lease expiration time is too long, it may never expire. See Lease TTL resetting on leader elections or single-node restart, but it isn’t documented, the official solution can be found in this PR

Pitfall Two

The mvcc db of etcd stores all versions of full data. If updates are frequent and data is relatively large, you will soon encounter mvcc: database space exceeded. At this point all write operations will fail, only read and delete are allowed. To recover, follow the instructions in the documentation: compact->defrag->disarm. But defrag is a heavy operation, it is blocking, during which that node (cluster) is basically useless. The solution can be to increase the space quota, but this can only temporarily solve the problem, and too large boltdb will affect performance, especially when MAP_POPULATE is set. It is best to configure auto compaction as soon as possible, and choose different compaction methods according to the actual situation. If the write volume is not large, auto compaction can meet the demand. But if the volume is large, caching should be done at the client level, and operations such as merging and compressing the data set should be performed to reduce the write volume. It is worth noting that after compaction, the disk file will not decrease. This is because boltdb does not return the disk space to the operating system after deleting data, but maintains a freelist by itself. New data will reuse this space. If the disk file exceeds the quota, defragmention must be done. Be sure to thoroughly read the documentation！

This is not actually a pitfall, but most people don’t know about it until they encounter a problem.. The real pitfall I think is that keepalive does not drop after mvcc: database space exceeded.. I looked through the source code of etcd and found that only grant/revoke lease goes through raft once, while renew is done directly on the Leader. The data that is landed only includes LeaseID and TTL, everything else is maintained in memory, which echoes the further thinking in Pitfall One.

Also, there are channels flying all over the place in the etcd source code.. Many interfaces are used to solve the problem of code coupling.. The implicit interfaces of Golang are really a pain to read.. I’m just ranting, who knows what I would do if I were in charge..

Pitfall Three

Network jitter or a small number of etcd nodes being isolated for more than election timeout and then rejoining the cluster will cause a re-election.

I was shocked when I encountered this problem, especially in the case of multi-data center deployment, the probability of this problem occurring is relatively high, affecting robustness. However, etcd v3.4 will add --experimental-pre-vote to solve this problem, v3.5 should enable it by default.

Let’s analyze in detail to deepen our understanding.

In the case without PreVote, the behavior shown in the logs of the online etcd nodes is as follows (raft related details are self-explanatory):

Network status ok

node	current term	log	state
A	5	[(term:5, index:1)]	Leader
B	5	[(term:5, index:1)]	Follower
C	5	[(term:5, index:1)]	Follower

Network status bad, node C is isolated for more than election timeout

node	current term	log	state
A	5	[(term:5, index:1), (term:5, index:2)]	Leader
B	5	[(term:5, index:1), (term:5, index:2)]	Follower
C	> 5 (keep increasing..)	[(term:5, index:1)]	Follower <-> Candidate

Network recovers, node C rejoins the cluster

At this point, C’s Term is definitely greater than 5, and it might be quite large. Suppose C received AppendEntries from Leader A, but since C’s Term is larger, C will not acknowledge A as Leader and will continue to try to hold an election; C’s log is not up-to-date, so A and B will reject C’s election in the RequestVote stage, causing C’s election to always fail. Now the problem arises, C does not acknowledge A as Leader and continues to increase its Term by holding elections persistently, it might not be able to join the cluster for a long time, and the problem seems more serious. According to the description in raft, Leader A updates its own Term and becomes a Follower when it receives a response with a higher Term in AppendEntries, so Leader A holds a re-election, and A is most likely to become the Leader again.

Of course, if C’s log is up-to-date, C’s election will be successful directly in the RequestVote stage.

Wait a minute… The actual behavior in the log is not as described in 3: C’s RequestVote is rejected for another reason, as follows:

B received C’s RequestVote (election) request, B found that there have been AppendEntries (empty for heartbeat) from Leader A within one election timeout, so B is sure that Leader A is still working normally and directly rejects C’s request
Leader A has been receiving AppendEntries responses from Follower B within the election timeout, so half of the people are still maintaining A’s authority, A can also confirm that it is Leader and directly reject C’s election request
The cause of the re-election is of course Leader A receiving a response with a higher Term in AppendEntries

So this is actually a bit tricky… Logically, Leader A should also maintain its authority during AppendEntries, but for a cluster with only three nodes, if C is not allowed to catch up with Term through the above-mentioned multiple election processes to rejoin the cluster, then if another node hangs, the cluster will be unusable.

What exactly is the use of the aforementioned PreVote? Refer to: CONSENSUS: BRIDGING THEORY AND PRACTICE (4.2.3 Disruptive servers)

Simply put, if the PreVote stage cannot be approved by the majority of nodes, it cannot enter the RequestVote stage, and the PreVote stage does not increment Term. So when C rejoins the cluster, continuing to retry PreVote will be rejected by one of the two reasons why RequestVote was previously rejected, and C will soon receive AppendEntries from Leader A and continue to remain in the Follower state without causing trouble.

Pitfall Four

When using golang’s Watch API, if the server performs compaction, it will return an ErrCompacted error. At this point, Watch will fail, the WatchCh will be closed, and if you use the following method, it will cause a dead loop in the code (Google: closed channel, nil channel, etc.):

for {
    select {
    case <-doneC:
        return
    case resp := <-watchC:
        doSthWith(resp)
    }
}

Of course, other irrecoverable errors will not retry and cause the above problem, see the documentation(⊙﹏⊙)b. The solution is to range the channel or check if the channel is closed or check the status of the WatchResponse, and then retry as necessary..

The most important thing is to read the documentation carefully.. If you don’t need to watch anymore, be sure to cancel the context, otherwise it will cause goroutine leakage.

Pitfall Five

Most of the APIs have an explicit context.Context parameter and no Close method. Note that a few APIs, such as concurrency.NewSession, have a Close method and an implicit ctx option. At this time, you need to pay attention. It is best to explicitly pass in concurrency.WithContext(ctx) to use ctx uniformly to avoid forgetting Close. Previously, using this API for election caused a Ghost Lock in the production environment, which was quite a pitfall..

Pitfall Six

False-positive problem, such as a simple PUT operation, may succeed on the server side, but due to timeout and other issues, the client receives a failed response. At this point, retrying or operating again may need to check the Version or Value through CAS to avoid the situation where the previous value is overwritten and the write is lost.

Pitfall N

TBD..