Snap! Websites
An Open Source CMS System in C++
Chapter 7. Mutal Exclusion — Lamport's solution, also called the Bakery Algorithm (p. 130)
The snaplock project is part of the snapwebsites environment. It is a daemon written in C++ allowing you to lock various resources on an entire cluster of computers for a small amount of time. A resource has a URL which is what we use to create the lock (we call it the Object Name.) The URL can be global to the whole cluster, specific to a website, or even specific to one special variable in a specific page of a specific website (in other words, very small granularity is possible.) At this time only exclusive locks are available. It is planned to add a semaphore capability which will offer a Read/Write (exclusive) or Read-Only (shared) lock mechanism.
The snaplock tool runs on any system that may require a lock. More or less, all the nodes in a Snap Cluster.
The locks are obtained using the Lamport's Bakery Algorithm ameliorated to support a non-complete network (i.e. using a praxos consensus instead of a complete consensus.) However, before we can have valid locks in a cluster, we first have to make sure that the cluster is considered up. This includes several premises:
When starting the elections, we must have a quorum of nodes (see below for computation of the quorum parameter,) otherwise snaplock could not guarantee that the locks would actually be locking anything exclusively. In other words, a snaplock daemon on its own could be asked to "LOCK" the same URL as another snaplock and return "LOCKED"... and clearly all the nodes would start working on the exact same resource, which is not what we want. Instead, by having a quorum, we make sure that if anyone asks for something to be locked, it for sure gets locked exclusively.
Further snaplock makes sure that the locking is fast by using at most three nodes to perform the actual lock (the Lamport's Bakery Algorithm). Many lock systems use just one node, often called a Leader or Master, but this can cause one big problem: it is not failsafe in case of a crash. i.e. on a crash, the information about all the existing locks is lost. Although it would be possible to then wait for a little while before accepting new locks (i.e. blindly timing out old locks), but offering new locks too quickly could create weird bugs and some of our locks last a very long time (hours, for some of our backends.) Really, without any information about those existing locks we can't even cancel them properly.
This is why we want three nodes to handle our lock mechanism.
Note: the number of leaders is limited to 3 (it can be 1 or 2 in a cluster of just 1 or 2 computers.) There are simplifications with having just 3 leaders which is why we chose that number at the moment. Later we will add a stronger failsafe by having leaders share the current locks with other non-leader snaplock daemons.
First snaplock starts and initializes various connections. Especially, it connects to snapcommunicator so it can learn about the network. However, other services cannot start making use of it until the cluster is up and leaders were elected. Note that in most cases leaders get elected as soon as the cluster of snaplock is considered up.
In the initialization step, the snaplock service generates a random identifier to be used in the election process. This allows for varying nodes to end up being leaders instead of always the exact same (assuming you have more than three electable nodes, of course.) There is also a priority which gives certain nodes a higher chance of becoming a leader. That priority can also be used to mark certain computers as not interested, meaning that they will never participate in an election.
The snaplock daemons have a status. It can be queried by using the LOCKSTATUS message. The reply is either:
This means that the snaplock service is not currently offering locks. It is not yet ready. It may be because it needs to have confirmation from its peers that they are up and running or there is a problem and the elections can't be conducted (i.e. you marked too many computers as being Off—not a candidate in the election.)
If you receive this reply to your LOCKSTATUS request, then the lock system is ready and a LOCK will quickly get a reply with a LOCKED message (or an error).
Note that LOCK messages sent to snaplock before the LOCKREADY status are cached by snaplock until they timeout. So you do not have to listen to the NOLOCK and LOCKREADY statuses, although many daemons do so they themselves can be marked as being ready or not (i.e. the snapdbproxy damone sends you NOCASSANDRA as a reply to its CASSANDRASTATUS as long as the snaplock says NOLOCK.)
The snaplock is in the NOLOCK state until valid elections happen and it enters the LOCKREADY state. If you shutdown many nodes all at once, then the state may return to NOLOCK. On a large enough live system, though, once the state is LOCKREADY, it should remain such forever except on computers where you are ready to do some maintenance (i.e. rebooting, upgrading, etc.)
I have a great book called Distributed Systems: An Algorithm Approach (pic. on right.) There are a few pages about Election Algorithms. How to determine which node should be the leader. However, there are two drawbacks with those algorithms:
So we based our algorithm on the one presented in Sukumar's book, but we changed a couple of things to allow for the algorithm to work properly in a non-complete cluster and to elect three leaders instead of just one.
The algorithm I wrote is still fairly simple. There are three important messages:
The CLUSTERUP message is sent by the snapcommunicator daemon once it knows of a little over 50% of the nodes in the cluster. The number of nodes needed to consider the cluster as being up is called the quorum. It is calculated as:
n quorum = - + 1 2
where n is the total number of computers. Once your cluster is properly installed, we know n since we have the complete list of all the neighbors. That number will fluctuate as you add and remove nodes, although once you reach 5 nodes, a fluctuation of +1 or -1 can easily be ignored.
Since snapcommunicator only sends the CLUSTERUP message when it goes from being down to being up, snaplock sends a CLUSTERSTATUS to make sure it will receive the message at least once.
Note that snaplock ignores the CLUSTERCOMPLETE message that snapcommunicator sends too. Knowing that the cluster is complete is not useful to ensure proper mutual exclusion within the quorum of connected nodes. We still obtain a cluster complete state with 3 or less nodes because we have a minimum number of nodes required to start the elections.
The LOCKSTARTED message is used to connect all the snaplock daemons between each others. When a snaplock daemon starts, it broadcasts that message as soon as it receives the CLUSTERUP message from snapcommunicator. It also sends that message whenever a new computer is added to the cluster.
At first, a node doesn't know about any other node so it broadcasts the LOCKSTARTED to all the others. Whenever a node receives the LOCKSTARTED, it replies with a LOCKSTARTED directly to the sender if it did not yet know that sender. It does not reply to said sender if it already knows the sender. That way we avoid infinite loops.
The current implementation waits for a quorum of snaplock instances to be known before to proceed with the leaders elections. The quorum is calculated the same way as for the CLUSTERUP message above. Note that our algorithm would work within a quorum of nodes without the need for all the snaplock to know each others, but that require a more complicated algorithm with many more messages to ensure the election was run properly.
Once the quorum is reached, the election happens (see below for details about that.)
The LOCKLEADERS message is sent once the election results were determined.
The snaplock daemon with the smallest IP address found in the quorum of snaplock daemons proceeds with the election and determines the three leaders.
This message is considered authoritative. In other words, it requests that all the previously defined leaders to be overwriten with this new list of leaders.
Each snaplock still takes the election date in account (i.e. the most recent election results are the ones kept.) This assumes our network has clocks that are properly synchronized (we use ntp daemons.)
The election happens if and only if all of the following tests are true:
Once all of these are true, the node with the smallest IP address at the time the cluster is formed elects three leaders and broadcast the results with the LOCKLEADERS message. At this point, further LOCKSTARTED messages include the leaders so new nodes coming up will automatically be told of the results and they do not take the initiative to handle a new election.
It's at this point that we enter the state LOCKREADY.
The way the election works is by sorting the candidates by ID. The ID is formed by concatenation of the following five elements:
IMPORTANT NOTE: This algorithm does not currently work 100% correctly in a dynamic environment (i.e. a cluster that gows and shrinks.) The algorithm fails if the number of computers is very close to the quorum and new nodes get added at the time of an election. This is because it affects the quorum and it may cause problems. As long as new nodes get added when the network is CLUSTERCOMPLETE or when no elections are taking place, there should be no concerned.
The voting process was a success and the cluster knows of the three leaders.
Each leader changes its priority to 0. This way on the next elections they will always be the winners and if one of them is lost (a reboot happens or the node crashes) then the other two will remain leaders and one other node is elected as a replacement of the lost leader.
Note that the priority going to zero is virtual. It's done at the time the elections happen. The computer taking care of sorting the node's IDs makes sure to insert "00" as the priority of existing leaders.
As soon as one of the three leaders is known to be lost, a rerun happens from the snaplock with the lowest IP address. Since the two leaders left are going to use a priority of "00", they will remain leaders (they will automatically be re-elected.) However, the order may still change so leader #2 may become leader #1, for example.
Up until the election happens, new nodes simply become potential candidates (unless their priority is "off," although such are still added to the list, but ignored in the election process itself.)
After the elections, new nodes are sent the results along the LOCKSTARTED replies. Nodes that a node receiving a LOCKSTARTED message with a list of leaders immediately enters the LOCKREADY state.
With three leaders the snaplock system is ready to accept LOCK requests. These include a URL and the time by which we want to obtain the lock and the duration of the lock, in case we do obtain it.
If the lock can't be obtained for whatever reason, then a LOCKFAILED message is sent to the lock requester.
If the lock is obtained, then the current date plus the lock duration are used to calculate the exact date when the lock will time out. That information is sent along the LOCKED message.
The client can send the UNLOCK message once done with the resource and it is expected to happen that way. However, if the UNLOCK does not arrive on time, the snaplock system sends an UNLOCKED message to the client which should react by releasing the resource as soon as possible and then send an UNLOCK message.
The snap_lock object in the snapwebsites library offers all of that functionality in an object one can use to obtain a lock. Very easy to use for the purpose. For locks used to do work that take a long time (like with backends), the get_timeout_date() can be used to know when the lock is going to timeout. The lock_timedout() function returns true once the lock has timed out. That second function actually reacts to the UNLOCKED message and tells you whether UNLOCKED was already received or not. Note that the lock stays in place until you react assuming you react quickly enough.
On the server side many things happen to know whether a lock is available or not.
First, if the LOCK message gets sent to a snaplock which is not a leader, that snaplock will proxy the LOCK message to one of the leaders. The first time, it selects a leader randomly. After that it uses a round robin mechanism so each leader will receive the LOCK message and has a chance to be the master.
When a leader receives a LOCK message, it becomes the master for that one LOCK request. If it dies before the lock is obtained, then another leader will take over and restart the process.
From there on, the lock process goes like this (which is an advanced implementation of the Lamport's Bakery Algorithm):
A/B are the other two leaders.
The master sends its message to A and B.
A and B reply to the master.
In all cases, when the master receives at least one reply that confirms that a message was received, it has a quorum and can consider that there is consensus. With such, it can move forward to the next step. When the master receives the second reply, it can be ignored.
At any stage, many tests are done to make sure that the algorithm is working as expected. If any error is detected, a LOCKFAILED happens with the "error" parameter likely set to "invalid".
If the messages take too long, somehow, then the process stops with a LOCKFAILED and the "error" parameter set to "timedout".
After you received the LOCKED, you should not receive a LOCKFAILED. Instead, you will get an UNLOCKED message. The UNLOCKED message may include an "error" parameter set to "timedout" in which case the lock will remain in place for at least as long as your lock was or 1 minute, whichever is the longest. That additional time must be used to stop using the resource that was locked or bugs may ensue as the previously locked data will be shared with others and if the resource needs to be edited by at most one process at a time, you're in big trouble!
Note that if the client closes its the connection to the snaplock daemon (i.e. the snaplock daemon receives an error when attempting to send a message to the client) then it is assumed that the client already decided to stop using the lock.
The source code is available on github. Releases are likely to have a more stable version. Note that the snaplock requires the libsnapwebsites library and snapcommunicator at a minimum. These have additional dependencies of their own (some of which are in our contrib folder.)
We also have binary builds on Launchpad. It is part of the snapwebsites project.
Under Ubuntu, add the repository this way:
sudo add-apt-repository ppa:snapcpp/ppa sudo apt-get update
Then install one of the library packages with the install command:
sudo apt-get install snaplock sudo apt-get install snaplock-doc
Please contact us on github by writing an issue.
The tool requires the snapwebsites environment and various contribs.
Long term, we are thinking of creating a snapcommunicator contrib (i.e. our network communication layer,) which will allow us to have snaplock as a contrib. Very much reducing the snapwebsites project as a result and especially allowing other projects to easily reuse snaplock in their own environment.
This tool is licensed under the GPL v2+.
Snap! Websites
An Open Source CMS System in C++