snaplock

Introduction

The snaplock project is part of the snapwebsites environment. It allows for locking the entire cluster for various resources. A resource has a URL which is what we use to create the lock. The URL can be global to the whole cluster, specific to a website, or even specific to one special variable in a specific page of a specific website (very small granularity is possible.) At this time only exclusive locks are available. It is planned to add a semaphore capability which will offer a Read/Write (exclusive) or Read-Only (shared) lock mechanism.

The snaplock tool runs on any system that may require a lock. More or less, all the nodes in a Snap Cluster.

The locking is obtained using the Lamport's Bakery Algorithm. However, we first make sure that the cluster is considered up before we allow locks (i.e. with too few nodes on a Snap! cluster, snaplock can't guarantee that a lock would actually locking anything properly.)

Further snaplock makes sure that the locking is fast by using at most three nodes to perform the actual lock (the Lamport's Bakery Algorithm). Many lock systems use just one node, often called a Leader, but this can cause one big problem: it is not failsafe in case of a crash. i.e. on a crash, the information about all the existing locks is lost. Although it would be possible to then wait for a little while before accepting new locks (i.e. timing out old locks), but offering new locks too quickly could create weird bugs too and some of our locks last a very long time (hours.) Really, without any information about those existing locks we can't even cancel them properly.

This is why we want three nodes to handle our lock mechanism.

Initialization

First snaplock starts and initializes various connections. Especially, it connects to snapcommunicator so it can learn about the network. However, other services cannot start making use of it until the cluster is up and leaders were elected. Note that in most cases leaders get elected as soon as the cluster is considered up.

In the initialization step, the snaplock service generates a random identifier to be used by the election process.

States

We have two states in regard to the election:

  1. Unknown Leaders
  2. Known Leaders

The snaplock is in the first state up until the election happens. At that point the system is in the Known Leaders. If you shutdown many nodes all at once, then the state may return to Unknown Leaders. On a live system, though, once the state is Known Leaders, it should remain such forever.

Election Algorithm

The algorithm used is fairly simple. There are three important messages:

  1. CLUSTERUP --- once snapcommunicator considers that the cluster is up, it sends this message to all the existing connections
  2. LOCKREADY --- all the snaplock daemons send this message to each others to get a list of all the available nodes on the network
  3. LOCKLEADERS --- once the election is over snaplock broadcasts the new leaders using the LOCKLEADERS message, which is authoritative (compared to the list of leaders in the LOCKREADY message)

The CLUSTERUP message is sent by the snapcommunicator daemon once it knows of a little over 50% of the nodes on the cluster. The number of nodes needed to consider the cluster as being up is the quorum. It is calculated as:

         n
quorum = - + 1
         2

The LOCKREADY message is used to connect all the snaplock daemons between each others. It is first broadcast when the snaplock instance received the CLUSTERUP message. This happens after the CLUSTERSTATUS request or when snapcommunicator actually detects that the cluster is indeed up.

At first, a node doesn't know about any other node so it broadcasts the LOCKREADY to all the others. Whenever a node receives the LOCKREADY, it replies with a LOCKREADY directly to the sender if it did not yet know that sender. It does not reply if it already knows the sender.

The LOCKLEADERS message is sent once one of the snaplock daemons proceeded with the election and found the three leaders. This message is authoritative, meaning that it requests for all the previously defined leaders to be overwriten with these new list, although snaplock still takes the election date in account (i.e. the newest election is the one kept.)

The election happens if and only if all of the following tests are true:

  1. No Leaders --- if the daemon already have leaders, do not attempt another election
  2. Cluster No Up --- until the cluster is up, snaplock has no clue to the total number of nodes are on the network, so it can't calculate a quorum and thus can't know whether to proceed with the elections
  3. Quorum Reached --- we have two possibilities here, if the total number of nodes is 3 or less, then we do not want a quorum, instead we wait for a complete network; if there are 4 or more nodes, then we wait for at least quorum snaplock to know of each others before proceeding
  4. Smallest IP Address --- the node in charge of the election is the one with the smallest IP address; we do that because at the time CLUSTERUP happens many of the nodes may be connected to many different other nodes making it is difficult to coordinate otherwise; the time it takes to process an election is so short, though, that using the smallest IP address node is pretty safe
  5. Three Candidates --- among the nodes that form a cluster we need to have at least three of them that can be candidates; some nodes can request to never be part of the elections (for example, you may want to avoid elected snaplock instances on Cassandra nodes); so the cluster may be considered up, but you may not yet have three nodes that are not turned off in terms of candidacy

Once all of these are true, the node with the smallest IP address at the time the cluster is formed elects three leaders and broadcast the results with the LOCKLEADERS message. At this point, further LOCKREADY messages will include the leaders so new nodes coming up will automatically be told of the results.

It's at this point that we enter the state "Leaders Known".

The way the election works is by sorting the candidates by ID. The ID is formed by concatenation of the following five elements:

  1. Priority --- as defined by the candidate_priority=... parameter of the snaplock.conf configuration file; the special value "off" can be used to remove this node from the candidacy; the smaller the priority, the higher the chance for this node to get elected, however, it will very much depend on the order in which the nodes are booted up and other similar factors
  2. Random Number --- a random number as generated in the snaplock initialization step; after that initialization step this random number never changes; this is used to make sure that the snaplock can travel between the various nodes and not be forced to stick on a set of nodes
  3. IP:port --- the IP address and port of the concerned node; the port is likely to always be 4040; the IP being unique beteen each nodes we can make use of it to make sure this identifier is unique; also the participating node with the smallest IP:port wins the election if the Priority and Random Number are both equal, somehow
  4. PID -- the process identifier of the snaplock instance generating this identifier; again this is used to make the identifier unique, but in this case that's for developers, on one system (i.e. while debugging we often run many instances of snaplock on a single system) each snaplock instance will have a different identifier
  5. Server Name --- whenever we get the list of leaders, we need to assign them to their corresponding node and that requires us having the name of the server on which these leaders run; it is also another unique identifier which makes sure that the sort when the elections happen is always clearly predictable; it is also important when receiving a list of leaders before we already registered one of those leaders through the LOCKREADY message

IMPORTANT NOTE: This algorithm does not currently work correctly in a dynamic environment (i.e. a cluster that gows and shrinks,) although dynamically allocated nodes that are going to be removed at some point can at least be assigned a priority of 15 so they don't have a chance of themselves becoming a leader, eliminating the need to handle any complicated process with them.

Elected

The voting process was a success and the cluster knows of the three leaders.

Each leader changes its priority to 0. This way on the next elections they will always be the winners and if one of them is lost (a reboot happens or the node crashes) then the other two will remain leaders and one other node is elected as a replacement of the lost leader.

Important Note: When a connection is lost and cuts off many snapcommunicators all at once from each others the existing leaders may all be lost. This is likely to cause problems later if a re-election happens and new leaders were selected in between. If an ex leader receives a message saying other nodes are leaders, it is expected to reset its priority back to its default. The leaders will receive a CLUSTERDOWN when that happens and that's what they do then.

Loss of Leader

When one of the three leaders is lost, a RERUN is requested. This message restarts the election process. Since the two leaders left have a priority of 0, they will remain leaders (they will automatically be re-elected.)

Nodes Hopping In

Up until the election happens, new nodes simply become potential candidates (unless their priority is "off," although such are still added to the list, but ignored for the election itself.)

After the elections, new nodes are sent the results along the LOCKREADY replies. Nodes that receive a LOCKREADY message with a list of leaders immediately enter the Known Leaders state.

Syndicate content

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly