The absolute statement is false.

The Consistency Problem of Distributed Lock

(NOTE: this article includes content translated by a machine)

The devil is in the detail.

Use Cases of Distributed Lock

Distributed locks are used to avoid the following issues when shared resources are read and written simultaneously in multi-process/multi-machine scenarios:

  1. Concurrent reading and writing leads to dirty data.
  2. Concurrent reading, based on the read data/state, results in duplicate operations, leading to competition and resource wastage.

The primary purpose of using distributed locks is to solve data consistency and coordination-related issues.

Characteristics of Distributed Lock

The most significant feature of locks is mutual exclusion. The implementation of distributed locks is generally reentrant, meaning that the lock owner can acquire the lock an infinite number of times without releasing it, without causing a deadlock. These two points are usually implemented using atomic CAS (Compare-and-Swap) operations, just like typical lock implementations.

To avoid deadlocks caused by process/node failures, general distributed locks require the lock owner to maintain a heartbeat with the distributed lock service. If the owner loses activity, the service will implicitly release the associated lock resources.

A common misconception is that if you use a TCP connection, which is stateful, you don’t need a heartbeat. In reality, the state of a TCP connection only exists in the memory of the two endpoints and there’s no “black magic”:

  • In cases where the process ends normally, a closing signal is generally sent to the other end. In cases of abnormal termination, the operating system may send a FIN packet for you, but this may vary between different operating systems.
  • If the entire operating system crashes or the machine loses power, no data packets will be sent, and the connection’s abnormal state cannot be detected by the other end.
  • For issues with the data link, such as weak network conditions, router issues, or cable problems, the connection might break without sending or receiving data, making it undetectable by the other end.

This is also why TCP requires a three-way handshake to exchange some necessary parameters, uses a random Initial Sequence Number (ISN), and has a TIME_WAIT state. TCP also has a KeepAlive mechanism, although the cycle is usually quite long, often two idle hours.

Another point is that using the connection level for liveliness detection is not very friendly when the entire network is jittery. If the connection can be immediately re-established after a disconnection, it can be considered that the connection was never lost. This can avoid the overhead of a potential lock switch and subsequent state initialization. So liveliness detection should be a higher-level abstraction.

For users, the characteristics of distributed locks are:

  • Mutual exclusion
  • Reentrant
  • Passive release when inactive (timeout)

Problems

Re-entrance may cause problems when multiple threads locally acquire the lock through the same conditions, resulting in race conditions without self-awareness. This kind of program BUG should also exist :D

The biggest problem is that the distributed lock requires the lock holder to keep heartbeat with the lock resource provider to avoid deadlock, otherwise the lock provider will actively (implicitly) release this lock:

Assume that within the time range indicated by the yellow line, ClientA did not actively release the lock nor crash. It might be due to some strange reasons that the program STW/froze, or the network jitter caused packet loss. Then LockService released the lock resource, and ClientB got the lock resource at this time. For ClientA, it will delay the perception that the lock has changed hands. That is, during the time range indicated by the red line, it still thinks that it holds the lock resource and silently does some operations. A more common situation is that even if ClientA realizes in time that the lock has been released, some ongoing operations cannot be terminated in time. In short, this uncontrollable time difference can cause problems in many scenarios.

Solutions

Distributed locks can solve problems related to consistency and coordination to the greatest extent, but beyond this, it is necessary to analyze and deal with specific business scenarios.

The solution is to ensure that each operation is idempotent, that is, performing the same operation multiple times will not produce side effects.

Idempotency is reflected in data and state, which falls into storage. Generally, this is where to start. This part mainly needs to understand the concurrency control and consistency semantics of various storage systems. Let’s briefly talk about it below :D

CAS

Most databases provide a basic atomic operation like CAS to ensure data consistency.

Assume there is a distributed task scheduling system. Through the distributed lock, a scheduler is elected to allocate and coordinate task execution. The global scheduler is mainly to avoid competition and have a globally consistent view of resources (usually not a bottleneck). Suppose there is a task that needs to be assigned to a resource-rich node. Empty indicates unallocated:

name assigned_to
task1

Normally, there is only one scheduler at a certain moment:

scheduler state
SchedulerA (locked) Local calculation (assigning resources to task1)

At this moment, if there is network jitter or high machine load that leads to the implicit release of the lock, even if SchedulerA senses the delay, it cannot immediately terminate the local calculation thread. A new scheduler is then generated. In name, there is only one scheduler, but in reality, two processes are executing the same resource allocation logic simultaneously.

scheduler state
SchedulerA (unlocked) Local calculation (cannot interrupt execution)
SchedulerB (locked) Local calculation

Assume SchedulerB completes the calculation first (SchedulerA has a higher load and less computing power), and uses the CAS operation to change assigned_to:

# Note this is not client-side logic, but a server-side atomic operation
lock()
if assigned_to == "":
    assigned_to = "node-abc"
    result = success
else:
    result = failed
unlock()
return result

The allocation status of the task becomes:

name assigned_to
task1 node-abc

A while later, SchedulerA also completes the allocation and uses the above CAS operation to update assigned_to, which will fail. In this scenario, this failure is acceptable. As long as the intermediate calculation process does not have side effects and does not affect the final result consistency, it is fine.

ABA Problem

Similar but not exactly the same as the above scenario, using CAS may produce ABA problem where some operations are repeated, leading to resource waste, dirty data, and inconsistent data.

Continue with the scenario of the distributed task scheduling system above. Suppose SchdulerA has not completed the calculation, and node-abc’s disk or memory stick is damaged and recycled. The administrator notifies the only legal scheduler, SchedulerB, of the offline status. SchedulerB also resets the allocation status to empty, and then re-allocates:

name assigned_to
task1

SchedulerA just completes the calculation, stable algorithm, old data, selected a bad node node-abc, completes the update through the C (comparison) in CAS:

name assigned_to
task1 node-abc

SchdulerB’s second calculation result selects another normal node node-xyz, but the CAS update fails. At this time, task1 may never be executed, or if the code is relatively complete, SchdulerB needs to wait for a certain time to reassign task1.

Possible solutions include designing a finer-grained unidirectional state transition mechanism, or adding versions (see below), timestamps (note that time in distributed systems can also cause problems..) and so on.

In more complex scenarios, this example also needs to check whether the assigned_to node is alive. If it is not alive, it needs to be reassigned. In certain situations, such as the scenario where the scheduler and the assigned node are inactive and then revived, as long as the timing is right, it may cause other problems, or the introduction of task execution status updates and retries. Of course, the impact depends on the scenario and needs. Looking at the problem from a different perspective may also be different.

Multi-version/Concurrency Control

etcd/zookeeper

For systems like etcd/zookeeper, they support MVCC. Each operation on the server side has a strictly increasing version number (global/local), similar to this:

operation key value version
PUT foo bar 1
PUT foo baz 2
DELETE foo 3

Then use this version number to do CAS operation. The above ABA problem can be solved like this:

# Note this is not client-side logic, but a server-side atomic operation
lock()
if version == vx:
    value = nx
    result = success
else:
    result = failed
unlock()
return result

(Note: This 100 is not that 100, please refer to the ABA problem above)

Here you may need to combine Lock and Store into one service, but it still depends on how users use the semantic distributed “lock” provided by etcd/zookeeper.

MySQL

I use MySQL less.. MySQL InnoDB is a multi-version data storage engine. Transactions have ACID characteristics, among which isolation levels are divided into several types (affecting atomicity and consistency at the same time):

  • READ UNCOMMITTED: May result in dirty reads, reading uncommitted data
  • READ COMMITTED: Phantom reads may occur, reading the same record twice in the same transaction may result in different results, and each read is the latest snapshot currently visible
  • REPEATABLE READ: (default), reading the same record in the same transaction returns the same result because it reads the latest snapshot seen for the first time. This may cause write skew/lost update due to multiple transactions using “old” data for comparison judgment
  • SERIALIZABLE: All concurrent operations can be thought of as serialized, similar to adding a global read-write lock. All transactions can be sorted by execution time, with a strict order of precedence and no simultaneous occurrence

Although MySQL supports MVCC, this V is invisible and cannot be used for conditional judgment like the above. It can only use locks and transactions. So how to use InnoDB to solve the problem of dirty data?

(Note: This 100 is not that 100, please refer to the ABA problem above)

Theoretically, you can use SELECT ... FOR UPDATE to add a write lock (with NO_WAIT parameters, different isolation levels may behave differently?), or use SERIALIZABLE isolation level to avoid this problem. After all, this lock may be large, and complex scenarios may need to consider using additional Fencing (state) mechanisms to solve it.

Consistency

Designing Data-Intensive Applications also mentions (not the original text): The role of databases (here broadly referring to various storage systems) is to facilitate data storage and retrieval, and they may offer features such as atomicity, isolation, and durability. Consistency is defined as an attribute of the application and is achieved by the application layer relying on database components. Discussing the validity and consistency of data without considering specific requirements, scenarios, and constraints is meaningless.

Requirements, scenarios, constraints!!!