Unique names for our rows

Sun
10/16/11

Table names

Tables are well defined so they get very specific names and do not need any special handling.

Column names

In most cases column names are like C++ member variable names: they are well defined. These names are expected to be defined with a namespace (i.e. the page data is written starting with page::.)

Once in a while column names are actually row names. This happens when a table is used as an index. In this case the column name uses the copy of a row name so there is nothing complicated to defining these names.

Row names

So... row names are complicated.

There are several row names we want to use.

Content

First of all, the content we create for a website. In that case, we use the full URL of the content after canonicalization. That works just fine since each name has to be unique anyway.

Data

The data table is similar to the content, but it has to support distinct branch entries and revision entries. Revision entries also include the name of the locale.

Branch: <uri>/<path>#<branch>

Revision: <uri>/<path>#<language>_<country>/<branch>.<revision>

In the revision the locale as a whole or the country part are optional. So the following two row keys are used too:

Revision: <uri>/<path>#<language>/<branch>.<revision>

Revision: <uri>/<path>#<branch>.<revision>

It is also possible to include drafts (from authors) and suggestions (from visitors). Those are linked to a revision, but they do not specify a revision in the row key. Also the language and country are not mandatory.

Revision: <uri>/<path>#user/<identifier>/<language>_<country>/<branch>

Revision: <uri>/<path>#suggestion/<language>_<country>/<branch>

Users

Other rows are for users. Similarly, users are just manage like all other content, so they are unique names based on a full URL. In that case, though, we could define users without the domain name to make them available to all the existing websites.

Other tables

Whenever tables require a counter and no unique entry such as the URL is available to define a new identifier, we want to be able to use a counter that works every time.

This is achieved using the hostname of the computer and a counter table. The counter table has one counter per host and each counter is a 64bit value that is increased by one every time it is read. Since it is specific to a host, the counting can be locked by the host (i.e. no need for an inter node lock, a very basic file locking mechanism is sufficient.)

The result is still unique although it is a string of the form: "<hostname>/<counter number>".

To be noted: this technique does not require the use of the Cassandra Counters, but it should not be used with counters such as statistic counters where the Cassandra Counters are much better qualified.