Failure detection and Network Partition management in NuoDB
"the secret of war lies in the communications"
- Napoleon Bonaparte
For NuoDB, it is important that Transaction Engines (TEs) and Storage Managers (SMs) communicate effectively with each other over a network. These interconnected nodes form the back-end fabric that a database instance rests on. Although modern day networks are fairly reliable, there is always the risk that communication between database nodes is affected by either a break in the network link or a partition of the network space. These type of network problems must be detected and resolved to prevent delays in database transactions and to ensure database consistency. To handle such network problems, NuoDB has a failure detection system. The rest of this article describes the basics of this system and how to enable it.
TEs are responsible for handling database transactions, while their SM counterparts handle the storage of data. A running database contains a minimum of one SM and one TE. The number of TEs and SMs can grow as the database needs to scale out to meet user demand. Each TE and SM are fully interconnected.
Figure 1 shows a graph representation of how TEs and SMs connect with each other to form a database instance. Each node in the graph represents either an TE or SM running on a server. The edges in the figure represent a logical network connection between peers. For simplicity, this graph does not include any information about a node's physical location. However, it's possible that nodes could be running in various data centers across the world.
Here is an example of two communication problems NuoDB's failure detection system can detect and resolve:
- Unresponsive nodes in the database. Unresponsive nodes in the database can stall database operations. This is because information that needs to be consistent across the database will be blocked at any unresponsive node.
- Network partitions. Any time groups of nodes in the same database are separated from each other, there is the risk of one group's state diverging from the other group(s). This is because each group might function as an independent (and fully functional) database. For example, if we have a database that has three TEs (TE1, TE2, TE3) and two SMs (SM1 and SM2). It's possible for the database to partition in to two groups due to a network failure:
Group A: (TE1, TE2, and SM1)
Group B: (TE3 and SM2)
From the point of view from each group and the clients connected to each group, the database is fully functional, though there are fewer peers to talk to. Sooner or later, the states of each group may diverge, creating two different views of the world.
NuoDB's approach to dealing with these issues is to disable portions of the database so that only one fully connected portion of the database can continue. In failure case #1, if we remove any TE/SM that has become unresponsive to its peers, the database can continue without delay. The same is also true for failure case #2. If we remove one of these groups from the database, we can prevent two versions of the database from forming. Obviously, for #2 it would be optimal to keep Group A and remove Group B.
The decision to remove unresponsive nodes is made by a quorum of their peers. A quorum in this case is a majority of the original set of nodes that can still communicate to each other. Once a quorum is formed, a decision is made who can stay and who must go. Any node that cannot form a majority quorum will shut itself down. Forming the quorum is done though a leader election process. Leaders are elected by receiving a majority vote from its peers. An election is triggered any time a node detects an unresponsive peer. The leader plus all of its fully connected peers, forms the quorum. Any node that cannot join a quorum elects to shut itself down.
What happens if a majority quorum cannot be established? This can happen when the database is evenly divided. All groups will remove themselves from the database in this situation.
What happens if a leader node becomes unresponsive? Any node can be elected leader. If a leader becomes unresponsive, a new election will be triggered and a new leader will be elected.
Taking another look at the failure scenarios described above, we can see how NuoDB's quorum approach resolves these issues. If there is a break in communication between two nodes (as described by failure #1), an election is triggered by TE1 and TE2. The leader will make a choice to remove either TE1 or TE2. In this example, the functionality of TE1 and TE2 is assumed to be equal, so removing either one is acceptable. For the network partition scenario (i.e., failure #2), each group will trigger separate elections. However, only Group A will be able to form a majority quorum (because it has a majority of the nodes) and its leader will decide to evict the members of Group B. Group B, on the other hand, will fail to obtain a majority and will elect to shutdown.
How do I enable this functionality? To enable this one needs to pass the following argument:
to all participating TEs and SMs. The argument value is number of seconds. This argument specifies the number of seconds a node will wait for its peer to be unresponsive before it triggers an election. Note that the timeout is one of several ways of detecting node failure. Network errors between nodes that cause socket failures are detected immediately and will proactively trigger an election. The timeout mechanism is there to handle the more subtle and nefarious cases where network traffic is being silently dropped or misrouted. The number of seconds chosen has to balance the tolerance of the application for performance dips due to network errors with the desire to avoid shutting down nodes due to transient outages. In our experience, a timeout of 30 seconds is a reasonable starting point as it allows the system to ride out many momentary network hiccups (message buffering can handle several seconds of outage) but won't block clients intolerably long if there really is a serious network failure (like a cabling issue or switch firmware bugs). No one value is perfect for all deployments, so we encourage users to investigate different values under test for their particular application.
Detecting communication failures between TEs and SMs is important for any NuoDB database deployed over a network. The communication between TE and SM peers allows for a single, consistent and responsive database that can be deployed across any network. It is assumed that communication failures between connected peers is inevitable. NuoDB's failure detection is the first line of defense in coping with such network calamity.
This article should give readers a basic understanding of how NuoDB handles two types of network failures 1) unresponsive TEs and SMs, 2) network partitioning. While this article does not cover all the complexities surrounding network failures and database consistency, it should be enough for the reader to begin to get acquainted with NuoDB's underlying mechanism. NuoDB is committed to providing a robust distributed platform in the face of network failures, therefore one should expect this system to be refined in future releases to come.
Some background to NuoDB's approach to this problem can be found at:
- T. Deepak Chandra and S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems. JACM, pages 225-267, 1999.
- Vijay K. Garg, Elements of Distributed Computing, pages 341-343. John Wiley & Sons Inc., New York, NY. 2002.