snaplock

Chapter 7. Mutal Exclusion — Lamport's solution, also called the Bakery Algorithm (p. 130)

Introduction

The snaplock project is part of the snapwebsites environment. It is a daemon written in C++ allowing you to lock various resources on an entire cluster of computers for a small amount of time. A resource has a URL which is what we use to create the lock (we call it the Object Name.) The URL can be global to the whole cluster, specific to a website, or even specific to one special variable in a specific page of a specific website (in other words, very small granularity is possible.) At this time only exclusive locks are available. It is planned to add a semaphore capability which will offer a Read/Write (exclusive) or Read-Only (shared) lock mechanism.

The snaplock tool runs on any system that may require a lock. More or less, all the nodes in a Snap Cluster.

The locks are obtained using the Lamport's Bakery Algorithm ameliorated to support a non-complete network (i.e. using a praxos consensus instead of a complete consensus.) However, before we can have valid locks in a cluster, we first have to make sure that the cluster is considered up. This includes several premises:

snapcommunicator sent us a CLUSTERUP message with the total number of nodes in the cluster
The number of known snaplock represents a quorum
The known nodes is at least 3 (it can be less if the total number of nodes is only 1 or 2—in this case the cluster must be CLUSTERCOMPLETE)
There are enough computers participating in the elections so enough can become leaders
The election ran succesfully
Each snaplock daemon is connected to the leaders

When starting the elections, we must have a quorum of nodes (see below for computation of the quorum parameter,) otherwise snaplock could not guarantee that the locks would actually be locking anything exclusively. In other words, a snaplock daemon on its own could be asked to "LOCK" the same URL as another snaplock and return "LOCKED"... and clearly all the nodes would start working on the exact same resource, which is not what we want. Instead, by having a quorum, we make sure that if anyone asks for something to be locked, it for sure gets locked exclusively.

Further snaplock makes sure that the locking is fast by using at most three nodes to perform the actual lock (the Lamport's Bakery Algorithm). Many lock systems use just one node, often called a Leader or Master, but this can cause one big problem: it is not failsafe in case of a crash. i.e. on a crash, the information about all the existing locks is lost. Although it would be possible to then wait for a little while before accepting new locks (i.e. blindly timing out old locks), but offering new locks too quickly could create weird bugs and some of our locks last a very long time (hours, for some of our backends.) Really, without any information about those existing locks we can't even cancel them properly.

This is why we want three nodes to handle our lock mechanism.

Note: the number of leaders is limited to 3 (it can be 1 or 2 in a cluster of just 1 or 2 computers.) There are simplifications with having just 3 leaders which is why we chose that number at the moment. Later we will add a stronger failsafe by having leaders share the current locks with other non-leader snaplock daemons.

Initialization

First snaplock starts and initializes various connections. Especially, it connects to snapcommunicator so it can learn about the network. However, other services cannot start making use of it until the cluster is up and leaders were elected. Note that in most cases leaders get elected as soon as the cluster of snaplock is considered up.

In the initialization step, the snaplock service generates a random identifier to be used in the election process. This allows for varying nodes to end up being leaders instead of always the exact same (assuming you have more than three electable nodes, of course.) There is also a priority which gives certain nodes a higher chance of becoming a leader. That priority can also be used to mark certain computers as not interested, meaning that they will never participate in an election.

States

The snaplock daemons have a status. It can be queried by using the LOCKSTATUS message. The reply is either:

NOLOCK

This means that the snaplock service is not currently offering locks. It is not yet ready. It may be because it needs to have confirmation from its peers that they are up and running or there is a problem and the elections can't be conducted (i.e. you marked too many computers as being Off—not a candidate in the election.)

LOCKREADY

If you receive this reply to your LOCKSTATUS request, then the lock system is ready and a LOCK will quickly get a reply with a LOCKED message (or an error).

Note that LOCK messages sent to snaplock before the LOCKREADY status are cached by snaplock until they timeout. So you do not have to listen to the NOLOCK and LOCKREADY statuses, although many daemons do so they themselves can be marked as being ready or not (i.e. the snapdbproxy damone sends you NOCASSANDRA as a reply to its CASSANDRASTATUS as long as the snaplock says NOLOCK.)

The snaplock is in the NOLOCK state until valid elections happen and it enters the LOCKREADY state. If you shutdown many nodes all at once, then the state may return to NOLOCK. On a large enough live system, though, once the state is LOCKREADY, it should remain such forever except on computers where you are ready to do some maintenance (i.e. rebooting, upgrading, etc.)

Election Algorithm

I have a great book called Distributed Systems: An Algorithm Approach (pic. on right.) There are a few pages about Election Algorithms. How to determine which node should be the leader. However, there are two drawbacks with those algorithms:

The cluster needs to be complete to proceed with the election, problematic if you want to allow for a fully dynamic cluster (possibly one which creates new nodes and destroys old nodes, but never just reboots a node!)
The algorithms only elect one leader and we definitely want at least three (right now we are limited to three because it simplifies our implementation of the message quorum necessary to obtain locks.)

So we based our algorithm on the one presented in Sukumar's book, but we changed a couple of things to allow for the algorithm to work properly in a non-complete cluster and to elect three leaders instead of just one.

Messages

The algorithm I wrote is still fairly simple. There are three important messages:

CLUSTERUP

The CLUSTERUP message is sent by the snapcommunicator daemon once it knows of a little over 50% of the nodes in the cluster. The number of nodes needed to consider the cluster as being up is called the quorum. It is calculated as:

         n
quorum = - + 1
         2

where n is the total number of computers. Once your cluster is properly installed, we know n since we have the complete list of all the neighbors. That number will fluctuate as you add and remove nodes, although once you reach 5 nodes, a fluctuation of +1 or -1 can easily be ignored.

Since snapcommunicator only sends the CLUSTERUP message when it goes from being down to being up, snaplock sends a CLUSTERSTATUS to make sure it will receive the message at least once.

Note that snaplock ignores the CLUSTERCOMPLETE message that snapcommunicator sends too. Knowing that the cluster is complete is not useful to ensure proper mutual exclusion within the quorum of connected nodes. We still obtain a cluster complete state with 3 or less nodes because we have a minimum number of nodes required to start the elections.

LOCKSTARTED

The LOCKSTARTED message is used to connect all the snaplock daemons between each others. When a snaplock daemon starts, it broadcasts that message as soon as it receives the CLUSTERUP message from snapcommunicator. It also sends that message whenever a new computer is added to the cluster.

At first, a node doesn't know about any other node so it broadcasts the LOCKSTARTED to all the others. Whenever a node receives the LOCKSTARTED, it replies with a LOCKSTARTED directly to the sender if it did not yet know that sender. It does not reply to said sender if it already knows the sender. That way we avoid infinite loops.

The current implementation waits for a quorum of snaplock instances to be known before to proceed with the leaders elections. The quorum is calculated the same way as for the CLUSTERUP message above. Note that our algorithm would work within a quorum of nodes without the need for all the snaplock to know each others, but that require a more complicated algorithm with many more messages to ensure the election was run properly.

Once the quorum is reached, the election happens (see below for details about that.)

LOCKLEADERS

The LOCKLEADERS message is sent once the election results were determined.

The snaplock daemon with the smallest IP address found in the quorum of snaplock daemons proceeds with the election and determines the three leaders.

This message is considered authoritative. In other words, it requests that all the previously defined leaders to be overwriten with this new list of leaders.

Each snaplock still takes the election date in account (i.e. the most recent election results are the ones kept.) This assumes our network has clocks that are properly synchronized (we use ntp daemons.)

Restrictions

The election happens if and only if all of the following tests are true:

Cluster Up --- until the cluster is up, snaplock has no clue to the total number of nodes on the network, so it can't calculate a quorum and thus can't know whether to proceed with the elections
List of Leaders Incomplete --- if the daemon already has leaders, do not attempt another election; however, if a leader was lost, re-process the election to replace the lost leader
Quorum Reached --- we have two possibilities here, if the total number of nodes is 3 or less, then we do not want a quorum, instead we wait for a complete network; if there are 4 or more nodes, then we wait for at least quorum snaplock daemons to know of each others before proceeding
Smallest IP Address --- the node in charge of the election is the one with the smallest IP address; we do that because at the time CLUSTERUP happens many of the nodes may be connected to many different other nodes making it difficult to coordinate otherwise; the time it takes to process an election is so short, though, that using the smallest IP address node is pretty safe
Three Candidates --- among the nodes that form a cluster we need to have at least three of them that can be candidates; some nodes can request to never be part of the elections (for example, you may want to avoid elected snaplock instances on your Cassandra nodes); so the cluster may be considered up, but you may not yet have three nodes that are not turned off in terms of candidacy

Once all of these are true, the node with the smallest IP address at the time the cluster is formed elects three leaders and broadcast the results with the LOCKLEADERS message. At this point, further LOCKSTARTED messages include the leaders so new nodes coming up will automatically be told of the results and they do not take the initiative to handle a new election.

It's at this point that we enter the state LOCKREADY.

Sorting snaplock Daemons to Find the Leaders

The way the election works is by sorting the candidates by ID. The ID is formed by concatenation of the following five elements:

Priority --- as defined by the candidate_priority=... parameter found in the snaplock.conf configuration file; the special value "off" can be used to remove this node from the candidacy; the smaller the priority, the higher the chance for this node to get elected, however, which nodes become leaders very much depend on factors such as the order in which the nodes are booted up
Random Number --- a random number as generated in the snaplock initialization step; after that initialization step this random number never changes until snaplock is restarted; this is used to make sure that the snaplock can travel between the various nodes and not be forced to stick on a specific set of nodes; it is safe for multiple nodes to end up with the same random number
IP:port --- the IP address and port of the concerned node; the port is likely to always be 4040; the IP being unique between each nodes we can make use of it to make sure this identifier is unique; note that the participating node with the smallest IP:port wins the election if their Priority and Random Number are both equal, somehow
PID -- the process identifier of the snaplock instance generating this identifier; again this is used to make the identifier unique, but in this case that's for developers, on one system (i.e. while debugging we often run many instances of snaplock on a single system) each snaplock instance will have a different process identifier
Server Name --- whenever we get the list of leaders, we need to assign them to their corresponding node and that requires us having the name of the server on which these leaders run; it is also another unique identifier which makes sure that the sort when the elections happen is always clearly predictable; it is also important when receiving a list of leaders before we already registered one of those leaders through the LOCKREADY message

IMPORTANT NOTE: This algorithm does not currently work 100% correctly in a dynamic environment (i.e. a cluster that gows and shrinks.) The algorithm fails if the number of computers is very close to the quorum and new nodes get added at the time of an election. This is because it affects the quorum and it may cause problems. As long as new nodes get added when the network is CLUSTERCOMPLETE or when no elections are taking place, there should be no concerned.

Elected

The voting process was a success and the cluster knows of the three leaders.

Each leader changes its priority to 0. This way on the next elections they will always be the winners and if one of them is lost (a reboot happens or the node crashes) then the other two will remain leaders and one other node is elected as a replacement of the lost leader.

Note that the priority going to zero is virtual. It's done at the time the elections happen. The computer taking care of sorting the node's IDs makes sure to insert "00" as the priority of existing leaders.

Loss of Leaders

As soon as one of the three leaders is known to be lost, a rerun happens from the snaplock with the lowest IP address. Since the two leaders left are going to use a priority of "00", they will remain leaders (they will automatically be re-elected.) However, the order may still change so leader #2 may become leader #1, for example.

Nodes Hopping In

Up until the election happens, new nodes simply become potential candidates (unless their priority is "off," although such are still added to the list, but ignored in the election process itself.)

After the elections, new nodes are sent the results along the LOCKSTARTED replies. Nodes that a node receiving a LOCKSTARTED message with a list of leaders immediately enters the LOCKREADY state.

Obtaining a Lock

As far as the client is concerned

With three leaders the snaplock system is ready to accept LOCK requests. These include a URL and the time by which we want to obtain the lock and the duration of the lock, in case we do obtain it.

If the lock can't be obtained for whatever reason, then a LOCKFAILED message is sent to the lock requester.

If the lock is obtained, then the current date plus the lock duration are used to calculate the exact date when the lock will time out. That information is sent along the LOCKED message.

The client can send the UNLOCK message once done with the resource and it is expected to happen that way. However, if the UNLOCK does not arrive on time, the snaplock system sends an UNLOCKED message to the client which should react by releasing the resource as soon as possible and then send an UNLOCK message.

The snap_lock object in the snapwebsites library offers all of that functionality in an object one can use to obtain a lock. Very easy to use for the purpose. For locks used to do work that take a long time (like with backends), the get_timeout_date() can be used to know when the lock is going to timeout. The lock_timedout() function returns true once the lock has timed out. That second function actually reacts to the UNLOCKED message and tells you whether UNLOCKED was already received or not. Note that the lock stays in place until you react assuming you react quickly enough.

As far as the server is concerned

On the server side many things happen to know whether a lock is available or not.

First, if the LOCK message gets sent to a snaplock which is not a leader, that snaplock will proxy the LOCK message to one of the leaders. The first time, it selects a leader randomly. After that it uses a round robin mechanism so each leader will receive the LOCK message and has a chance to be the master.

When a leader receives a LOCK message, it becomes the master for that one LOCK request. If it dies before the lock is obtained, then another leader will take over and restart the process.

From there on, the lock process goes like this (which is an advanced implementation of the Lamport's Bakery Algorithm):

client: LOCK
proxy: forward LOCK (only if client does not talk directly to a leader)
master: LOCKENTERING
A/B: LOCKENTERED
master: GETMAXTICKET
A/B: MAXTICKET
master ADDTICKET
A/B: TICKETADDED
master: LOCKEXITING
A/B: ACTIVATELOCK
master: LOCKED
master: (on timeout) UNLOCKED error="timedout"

A/B are the other two leaders.

The master sends its message to A and B.

A and B reply to the master.

In all cases, when the master receives at least one reply that confirms that a message was received, it has a quorum and can consider that there is consensus. With such, it can move forward to the next step. When the master receives the second reply, it can be ignored.

At any stage, many tests are done to make sure that the algorithm is working as expected. If any error is detected, a LOCKFAILED happens with the "error" parameter likely set to "invalid".

If the messages take too long, somehow, then the process stops with a LOCKFAILED and the "error" parameter set to "timedout".

After you received the LOCKED, you should not receive a LOCKFAILED. Instead, you will get an UNLOCKED message. The UNLOCKED message may include an "error" parameter set to "timedout" in which case the lock will remain in place for at least as long as your lock was or 1 minute, whichever is the longest. That additional time must be used to stop using the resource that was locked or bugs may ensue as the previously locked data will be shared with others and if the resource needs to be edited by at most one process at a time, you're in big trouble!

Note that if the client closes its the connection to the snaplock daemon (i.e. the snaplock daemon receives an error when attempting to send a message to the client) then it is assumed that the client already decided to stop using the lock.

Download

The source code is available on github. Releases are likely to have a more stable version. Note that the snaplock requires the libsnapwebsites library and snapcommunicator at a minimum. These have additional dependencies of their own (some of which are in our contrib folder.)

We also have binary builds on Launchpad. It is part of the snapwebsites project.

Under Ubuntu, add the repository this way:

sudo add-apt-repository ppa:snapcpp/ppa
sudo apt-get update

Then install one of the library packages with the install command:

sudo apt-get install snaplock
sudo apt-get install snaplock-doc

Support

Please contact us on github by writing an issue.

Requirements

The tool requires the snapwebsites environment and various contribs.

Long term, we are thinking of creating a snapcommunicator contrib (i.e. our network communication layer,) which will allow us to have snaplock as a contrib. Very much reducing the snapwebsites project as a result and especially allowing other projects to easily reuse snaplock in their own environment.

License

This tool is licensed under the GPL v2+.

Snap! A C++ Open Source CMS