Discussion: One Table versus Many Tables

Mon
09/05/11

Cassandra versus SQL

Each block of data that we manage in Cassandra is 100% an equivalent to an SQL table.

For example, we could create a simple table to manage pages as this one:

page_identifier
title
body

This table includes a page_identifier that gives you a way to reference this table from another table. Now I write another module that extends a page and gives it a teaser capability. What directly comes to mind (at least to me) is adding the teaser field to the table:

page_identifier
title
body
teaser

This is very good since that way we do not increase the search time very much. The table now includes all the information we need.

However, most programmers, when they create a new plug-in they also create a new table that's handled by that plug-in (especially in SQL.) This means you'd get the teaser in a separate table like this:

page_identifier
teaser

This is fine at first... imagine now that you add 100 plug-ins and that each one of them add a few fields that extend the page. That means at some point you end-up with 101 tables all referencing the page_identifier with a one to one scheme. In other words, these 100 plug-ins could all have been defining extra fields in the main page table and increased the speed in reading the data by a good 10 folds.

With Cassandra and the mechanism we chose here, we can very easily add fields to any existing table by adding a namespace to the field names. So instead of plain variable names, we want to use qualified names for all of our variables as in:

content::title
content::body
teaser::teaser
blog::sticky
links::content::parent
...

This can go on as much as we want and there is no possibility of clashing names. (This is exactly the way C++ does it.)

Content and Data Tables

When we added support for branches and languages, the data had to have multiple entries for each page. In order to handle that without having to change the field names each time, we instead decided to change the row name.

In effect we decided to have two tables, although we later will want to have three (because we can optimize the usage of the revision table by telling Cassandra that rows are written once and then pretty much become read-only.)

The idea of having multiple table was first mainly to make sure the content table remains as small as possible (i.e. exactly one row per page.)

Content Table -- all the pages, includes the necessary references to know which branch and revision are current, which are the last, and which to edit.

Branch Table -- all the branch information for each page (data tha applies to all the revision of a branch such as links between pages).

Revision Table -- the content of the pages, the title, body, teaser, etc.; in case of an editor form, any number of fields may appear in a revision.

Note: the Branch and Revision tables are both using the data table at this point. We'll break these as soon as possible...