Lists

Sun
09/04/11

What are lists?

Note: the feature is defined here: List feature [core]

Implementation: Indexes

In general, lists are either hard coded or a mechanism is offered to the end user to define their own lists with some sort of a query. Since we're not using SQL, the query has to be generated manually, the core system will include all the necessary tools to quickly offer list capabilities of any data that the core and 3rd party plugins offer.

Note that the proper terminology for a list is an index. They are the equivalent of SQL database indexes.

Creating and Maintaining Lists

Unwanted Optimization

Lists that are limited in the number of entries they offer could use an index with only that many items1.

So for example, the list of the last 5 comments would not need to gather more than 5 comments. All the other comments can be discarded from the index (which makes for a very small index and thus faster to handle!) However, the truth is that we nearly always need a complete list of all the data.

In other words, in case of the comments, we anyway want to have a complete list of all the comments. This is necessary to handle different output such as a comment RSS feed, or an administrative list of all the comments to moderate them as required. If you look closely, to avoid a lot of duplication, the best method is to have a list of all those items. To show the last 5 comments, just read 5 items from the list! (This being said, we may want to have several lists of comments: published, unpublished, marked as spam, trashed, etc.)

Continuous Maintenance

Lists are to be continuously maintained since they change as soon as new content is published, old content is updated or deleted. In other words, if someone posts a new comment, that new comment is definitively more recent than the most recent comment currently found in the index of the most recent comments. In other words, we add the new comment at the beginning of the list. Similarly, a comment moved to the trash gets removed from the list of published comments and moved to the list of trashed comments. This way the handling of the indexes is and remains very fast (doing it continuously means doing very little work which will be very fast compared to creating a whole index at the time the data is needed.)

New Lists, Modified Lists

Adding and removing data to an existing list is very fast to maintain on the fly and very useful as you get lists that are current. This is very much the way an SQL database manages data. Whenever you add a row to a table, it also updates all of this indexes to place that row in the correct location and thus make it findable very quickly if queried on that index data.

However, this does not work when the query used to generate the list is just created or later gets modified (i.e. modifying the data gets the data resituated in all the lists where it appears, but modifying the search criteria means different data is now to appear in the list. That's a much more involved change.)

In those cases we want to reset the index (in case of a modification of an existing index) and start checking all the referenced data available. In our case, we may just create a new index each time since the result is the same. This means modifying a list is equivalent to deleting the old list and creating a new list (although the old index could remain available until the new index generation is complete.)

Note that creating new lists is done on a backend server to avoid overloading front end servers.

Implementations of lists

With Cassandra, we create our lists using columns. Any one page can include one or more lists. For example, a page with tags, similar posts, boxes on the sides... includes a list of tags, a list of posts with similar tags or keywords, and a list of boxes (references to other pages of content displayed as teaser or box as may be appropriate.)

The list of all the comments is actually a page which gathers row keys of all the pages representing a comment. Some such pages may actually be hidden. For example, a site that does NOT promote social networking of its users probably wants to keep its list of users hidden.

A page with comments as a list of its comments and each one of those comments has a link back to that page. This is called dual linked lists, although the term dual may be somewhat confusing in this situation since we have many links in each page. For example, a page that represents a comment can have a link to a user representing the author, a link to the user who moderated the comment, a link to the page where the comment appears, a link to the list of all the comments, etc. However, for each one of those links there is a link back to the comment. This is how any changes can quickly be rippled through the entire system.

Note that Cassandra is limited to 2 billion (2^31 - 1) columns and therefore your lists are limited to 2 billion entries.

Redundancy Considerations

When a new list is created, a way to detect whether the exact same criteria is used with an existing list should be used so that way we can avoid creating two lists with the exact same data. Instead, we can increase a counter so that existing list can be deleted twice before being removed.

These circumstances should not occur within the Core system, but plugins could add lists and two distinct plugins that do not require one another could end up creating the exact same index. For example, a Feed plugin could create an RSS feed of the comments and need the list of published comments. Similarly, the Digest plugin could be used to aggregate a the set of all the latests comments and send them to registered users once a week. In both cases, the plugins need the list of comments (which in a way is a bad example since that would be available in the Core and thus not necessary to replicate... but similar situations can happen with more complex queries or data that Core does not control.)

1. We may want to limit the list to a few more entries in case some get deleted. Also we may just want to avoid limited indexes that way we can share them with more ease in case Index A can use Index B, but B would need to be applied to all the corresponding data and that's actually the main case.