Snap! Websites
An Open Source CMS System in C++
Cassandra is very light weight, contrary to a standard database, they coin the safety of your data on the fact that it gets replicated many times, not on the fact that it gets transported safely between you and its journal and the drive.
There is a huge impact to that light weight though. Once in a while, the tables, or more specifically, a node journal get mangled. When that happens, you can continue to use Cassandra for any data that appears before the mangled data. This gives you the impression that everything works, when in fact, something is awry in that node.
An interesting side effect to this, if you ask me, is that Cassandra is not going to take a chance and reset your tables automatically. Instead it does nothing at all to your data without your authorization. This sounds good until you get invalid data and the database decides to throw when it cannot read it.
When such problems happen, it is very likely that Cassandra will write something in its log about the failure. I suggest you look at it to see what it says before proceeding. If it sounds like something completely different, avoid running the commands here which could destroy some important data/
In most cases, libQtCassandra receives a timeout error from Thrift. If Cassandra itself fails, it will throw and thus it loses its connection to Thrift. Unfortunate, but Thrift does not detect the disconnection immediately. The result is a long pause until Thrift decides that it was too long and throws a Timeout error.
There are cases where the database does not throw. Particularly, one of our users had a problem with their database while reading all the tables. Apparently, in that case the system does not time out, but instead returns invalid data. libQtCassandra will detect such invalid data, although it is not otherwise told anything which is a problem, I think.
The effects can vary depending on the error you get from the Cassandra node. In this last case the user was getting:
what(): ColumnDef and QCassandraColumnDefinitionPrivate names don't match
That error is internal to libQtCassandra when we detect a name problem between two different locations. It really looks like in some cases we are not even getting a valid value returned and we may need to fix libQtCassandra in that regard, but that's complicated to replicate the problem.
In my case, Snap! generates an error that looks like this:
-- Row: 04ae93dc70aea72bdb7e673b6c96cc55 terminate called after throwing an instance of 'org::apache::cassandra::TimedOutException' what(): Default TException. Aborted (core dumped)
To fix the problem, you may want to run the scrub command of the nodetool utility.
nodetool scrub <context> [<table>]
To run a repair on your context, use the nodetool as shown. If you want to scrub just one table, you can, however, from my own experience, running it on the entire node fixes all the tables with a problem all at once. Remember that any data that is still available before the erroneous entry will appear to work, even if that data comes from a table which needs scrubbing (repair.)
Update: Today I ran in a little problem similar to this one and I had to specifically scrub the files table. Without doing so, just scrubing the context was not enough to fix the bug. So I ran:
nodetool scrub sanp_websites files
Note that if you have more than one node, you probably want to run the repair command on the node that you just scrubbed, to make sure you are still properly synchronized.
Note: the system calls the tables that cause a problem the SSTables.
Note: In Snap! we have a test named tests/test_dump_all.sh which one can use to see what fails. It may take a while and you want to have most of the output going to be able to figure out which table is affect.
Snap! Websites
An Open Source CMS System in C++