Updating libQtCassandra (0.4.2)

Sat
08/18/12

I'm working on an update of libQtCassandra so it works with 1.1.0 of Cassandra (and 0.8.0 of Thrift.) The update will also include additional tests and hopefully enhance the interface to support super-columns and columns with multiple names (a:b:c...) If time allows, I may even add counters.

Thirft

The compilation of the Cassandra interface with Thrift 0.8.0 did not compile as is. I had to make many fixes so that g++ compiled the output and changes to the libQtCassandra library too. It seems that Thrift removed a certain number of headers. It is also possible that the newer version assumes you have a better boost library than the one I have at this time. I will keep the interface compilation done with 0.7.0 in case you prefer to compile with the older version.

Cassandra: Replication Factor (KsDef)

The direct replication factor parameter was deprecated in Cassandra version 1.1 (see cassandra mail list message: [jira] [Commented] (CASSANDRA-4294) Upgrading encounters: 'SimpleStrategy requires a replication_factor strategy option.' and refuses to start.) This is now taken as an option only. In other words, you should use setDescriptionOption("replication_factor", "1") instead of setReplicationFactor(1). I changed the setReplicationFactor() function so it automatically calls the setDescriptionOption() fuction automically for you. However, of you were calling setDescriptionOptions() after setReplicationFactor(), the value will be lost.

From the thrift Cassandra interface:

/* describes a keyspace. */
struct KsDef {
    1: required string name,
    2: required string strategy_class,
    3: optional map<string,string> strategy_options,

    /** @deprecated, ignored */
    4: optional i32 replication_factor,

    5: required list<CfDef> cf_defs,
    6: optional bool durable_writes=1,
}

Cassandra: Column Family (CfDef)

Many of the cache and memory parameters were deprecated and are ignored since Cassandra version 1.1.0. New fields were also introduced.

Large Test (1.2 million rows)

I wanted to create a test to check out a very large number of rows and especially the readRows() function. Unfortunately, this did not come out as I expected at first. The cluster defines a Partitioner and if it is set to RandomPartitioner, then the old version of readRows() function (which makes use of the get_range_slice() API call) would not work right.

You can see the Partitioner using the describe cluster command in your Cassandra CLI.

[...] describe cluster;
Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
    c289d6bb-cd39-3aa3-9545-b6677eb4a2a9: [127.0.0.1]

Note that you cannot change the partitioner once you created a cluster. That's too late. This is because the system only appends data to table files (i.e. write once, read many times.)

So if you want to switch but you already have data in your cluster, you won't be able to do it. Plus, the ByteOrderPartitioner (BOP) is not a bad thing, but the entire cluster will be set that way and it may slow down some of your nodes much more than the RandomPartitioner (RP). So the RP is probably your best bet, it's just that it prevents you from reading slices of rows in the order you'd otherwise expect. If you need a specific order, use columns in another table as an index. And if you have a very large number of rows to index, you can break up your index in multiple rows since each row can have a different set of columns (interesting concept!)

So, the large test works, but although I write rows 0 to 1.2 million, I retrieve all of that data in what looks like a completely random order.

IMPORTANT UPDATE

The large test does NOT work right. This is not because of the test itself, but because Cassandra cannot actually handle such a large set of write and read requests in a raw, at least not with a very small cluster (i.e. one computer in most of my testing.) There are comments/information about such problems being exposed especially when someone writes code to copy data from one cluster to another and expects to be able to run it full speed. That doesn't work right! There are several reasons why, but more or less, you will lose data if you attempt to write millions of new rows too fast.

Environment Management

As I did some clean up on my system, I found out that the thrift library part of the 3rdParty sub-folder was not found at link time. A command was missing in the CMakeLists.txt. I added that in and you should not have to do anything special to link with thrift.

Documentation

Many of the functions were tweaked to support 1.1 and as a result their documentation changed. If you start having a problem with one of your functions, you may want to check the latest documentation to make sure you know whether you're good to go with it or not.

update

Alexis Wilke's blog

Comments

Re: Looking forward to it.

Thu 08/23/12 by Admin

That would be great! I now have 0.4.2 available on SourceForge.net. Let me know if you find problems. Don't hesitate to post a patch on SourceForge.net so I can make updates for the next version.

Thank you,
Alexis

Looking forward to it.

Wed 08/22/12 by Anonymous (not verified)

I can help test/debug in Ubuntu/Solaris and OS X

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly

Recent blog posts

more

Snap! A C++ Open Source CMS