Adding support for composite column names (libQtCassandra 0.4.3)

Sat
09/01/12

I added some support for composite columns to libQtCassandra 0.4.3.

This allows one to read and write columns with composite names. This just means a way to read a column with optimized comparisons (i.e. if you include a 32 bit number, it uses 4 bytes in the table, whereas, using a name with a 32 bit number converted to hexadecimal would be 8 bytes, not counting posssible separators and potential for the separators to not allow easy constant sorting.)

Contrary to belief, the different part of a composite column name are not separated by colons. This is only in the CLI and many 4th generation languages (i.e. PHP, Ruby, etc.) The lower level implementation expects an array of names defined as follow:

<size><name><terminator>

Where the <size> parameter is an unsigned short (2 bytes) number written in big endian (i.e. a size of 0x0005 is written as 0x00 first and then 0x05 in the stream of bytes.)

The <name> is whatever the column expects at that position. You must make sure it matches the type of the column. If UTF-8 then you must have a valid UTF-8 string (i.e. no 0xFF characters written as is!) If Integer then you must pass the integer as a big endian number on a number of bytes that correspond to the Integer (i.e. a Long is expected to be exactly 8 bytes.)

The <terminator> may be -1, 0, or 1 depending on the need and where or when the value is passed down to Cassandra. When just reading or writing a cell in a row, you want to use zero. If you are querying, then -1 and 1 can be used too. The -1 means this value excluded. The 1 means end of column names. Obviously, the zero and 1 means this value included. The fact that you can end the list means you can query rows with a limited number of composite column names instead of all of them. (To insert or read a specific cell you must always have all the names so zero is always used; for a query, however, the idea is to use 0 when you want to have an exact match including that value, and -1 excluding that value.)

Let see an example for inserting a new value in Cassandra:

Say you have a composite column definition as follow:

CompositeColumn(UTF8, Integer, ASCII)

Inserting a new value with composite key: "ŒIL:12994:SNAP"

The first character in the key is Œ (O and E glued together, which is called a diagraph, ligature, or grapheme) must be translated to proper UTF-8. This means 0xC5 and 0x92. The resulting string is 4 bytes. Therefore the size is 0x00 0x04. The final string of bytes looks like this:

"ŒIL" -- 0x00 0x04 0xC5 0x92 0x49 0x4C 0x00

"12994" -- 0x00 0x04 0x00 0x00 0x32 0xC2 0x00

"SNAP" -- 0x00 0x04 0x53 0x4E 0x41 0x50 0x00

So we send all those bytes to the Cassandra server via Thrift as the column key:

00:04:C5:92:49:4c:00:00:04:00:00:32:C2:00:00:04:53:4E:41:50:00

(Note: the fact that all 3 composite keys are 4 bytes is just due to my poor choice of column names.)

As we can see, the integer is sent as an integer (in big endian). This means we do NOT have to convert the integer to ASCII and then back to an integer before sending it to the Cassandra server. Instead, we use a set of QCassandraValue objects and no conversions takes place1

Note that if some names are missing, the system can send a NULL name. To do so, the system uses three zeros (size of zero, no data, and the null terminator):

0x00 0x00 0x00

A NULL name is perfectly valid, but it may add complexities when querying your columns.

Read a specific value back

This is exactly the same as the Insert function. The column names are all terminated with 0x00 and they all should include all the names as defined in the insert.

Querying a range of columns2

Querying is very similar to inserting and direct reading only the number of columns specified can be 1 to n, n being the total number defined when creating the table. For this reason, you must terminate the list of column names with a one. From the same example, it would look like one of these:

00:04:C5:92:49:4c:01
00:04:C5:92:49:4c:00:00:04:00:00:32:C2:01
00:04:C5:92:49:4c:00:00:04:00:00:32:C2:00:00:04:53:4E:41:50:01

The <terminator> can also be set to -1 to retrieve columns with a name larger than the specified value (instead of larger or equal.)

To start querying from the very beginning, use a completely empty key ("").

1. I know that converting integers to and from ASCII is not that bad. It is rather fast. Now just think about doing such 1 million time a minute and you can imagine how much computation time you save! But to take it further, the computer would have to determine the column type in order to convert anything... that would also be somewhat slow, although it could be done once and cached. Next you'd have to also convert floating points and that is not only much slower than integers, it generally is not a good idea for you are likely to lose a few bits in the process!
2. At the time of writing, I did not yet implement this part, although the encoding is for sure exactly the same, the terminator of -1 versus 0 needs to be tested to ensure the encoding is indeed working as documented.

Cassandra

Alexis Wilke's blog

Comments

Re: Adding support for composite column names ...

Wed 09/05/12 by Anonymous (not verified)

Hi,

After looking more carefully to the code, it really seems that it uses very basic functionality of Qt, so I guess you are right that I can use Qt 4.6 without a problem.

In that case, I will stick with the current implementation since it works perfect in Debian squeeze with Qt 4.6.2.

Maybe it would be possible to modify the file qt.cmake in the distribution so the version requirement is lowered to 4.6 ? For me it is a minor issue since it takes seconds to edit the file, but maybe people go away when seeing that the version requirement for Qt is 4.7. Specially in the server world which is where this library will be most used. For example, the latests versions of Debian (squeeze) and CentOS (6.3) ship with Qt 4.6.2.

Thanks!

Re: Adding support for composite column names ...

Tue 09/04/12 by Admin

Hi Nacho,

I would think that Qt 4.6.x will work just fine. We have other dependencies in the main project (Snap C++) which require 4.7.x and we use the same file for libQtCassandra.

Boost is cool, at least for the basic classes such as shared_ptr<> and boost::uint32_t. However, for advanced things, it breaks quite much between versions. For our Snap C++ project, we wanted a library that would last a while and not require us to redesign everything each time (boost is always this "test bed" code... and it can easily break between versions.)

This being said, for a Cassandra C++ library, it is probably well adapted.

In regard to using std::string, it's fine. I like the Qt library because you can convert between UTF-8 and UCS-2 without having to do anything. But if you only deal with UTF-8 data, it probably doesn't matter either way. Plus an std::string supports null characters which is how Cassandra handles binary buffers.

We also use Qt for network connections, HTTP and our own protocols. Again boost is not as well adapted for that.

Now, the one thing we use boost for (hey!) are signals2. That works much better and much faster than the Qt events since Qt events are dynamically allocated so wasting a lot of time at runtime...

Would I be interested in such a change to libQtCassandra?

Well... obviously, it would clash with the name of the library, wouldn't it? Personally, I am not directly interested, but for Cassandra users, it certainly is a good idea to offer them to use that version too (give them the choice.)

Would you be able to take the time to maintain that Boost version? If so, you could create a libcassandra or libboostcassandra project on SourceForget (or some other such place) and offer the library there. I can then add a link or two on the libQtCassandra page to send people there.

If you don't think you'd have to time to maintain it, I could upload it in a folder along the libQtCassandra SF.net for others to enjoy but it would not be maintained...

Re: Adding support for composite column names ...

Tue 09/04/12 by Anonymous (not verified)

Hi,

I've tried to install libQtCassandra 0.4.4 in Debian squeeze and the problem is that the stock version is Qt 4.6.3. Cmake does not like it, so I've edited the qt.cmake file to allow to use Qt 4.6.3 ... and it seems to work fine. Do you know if it is safe to use Qt 4.6.x ?

On the other hand, I've been hacking libQtCassandra to get rid of the Qt dependency and use boost instead. It seems to work fine, but I had to strip all the utf8 conversions and use std::string instead of QByteArray. The tests run fine, but I'm not sure if it is really ok. Would you be interested in such a change ? I think that the Qt 4.7.x dependency is a little bit too much for many people.

Thanks in advance

Nacho

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly

Recent blog posts

more

Snap! A C++ Open Source CMS

Adding support for composite column names (libQtCassandra 0.4.3)

Comments

Re: Adding support for composite column names ...

Re: Adding support for composite column names ...

Re: Adding support for composite column names ...

Recent Posts

Recent blog posts