Snap! Websites
An Open Source CMS System in C++
As you use Cassandra, once in a while some of the tables they handle will get corrupted. In most cases, I think that happens when I restart my computer without first killing Cassandra. One problem with Cassandra is that it may take seconds between the kill command and Cassandra actually exiting. If you shutdown your computer, the kill sent to command waits something like 1 second before sending a -9 (i.e. actual KILL signal instead of just TERM.)
The result is that it may leave some files (tables) in an unknown state. These files can be deleted though, but you have to be careful. Make sure you use a proper installation of Cassandra on a production system so that way you can make sure to get Cassandra shutdown cleanly. Ubuntu users are in luck since there is a PPA offered by Cassandra to directly install it on your systems. That installation includes proper start and stop scripts.
When an error occurs with invalid tables, you are most likely going to be receiving timeout errors. These errors occur because Cassandra throws and thus never replies to Thrift and at some point Thrift decides to throw that timeout error. You can find the Cassandra errors in its logs. In most cases they are not really sensical errors (i.e. row key != row key... if you ask me...), but when in link with invalid tables, it is not unlikely to say that it cannot seek to the right location.
Example of seek() error:
ERROR [ReadStage:2] 2014-02-24 22:34:32,001 CassandraDaemon.java (line 185) Exception in thread Thread[ReadStage:2,5,main] java.lang.RuntimeException: java.lang.IllegalArgumentException: unable to seek to position 1044584 in /home/alexis/cassandra2/data/snap_websites/files/snap_websites-files-jb-1-Data.db (753318 bytes) in read-only mode at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1901) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.IllegalArgumentException: unable to seek to position 1044584 in /home/alexis/cassandra2/data/snap_websites/files/snap_websites-files-jb-1-Data.db (753318 bytes) in read-only mode at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:274)
To repair your tables, you want to run the following two commands:
nodetool scrub snap_websites nodetool repair snap_websites
You may specify the exact table that generated an error if you know which one it is. It is definitively easier to just delete all the SSTables with one scrub, and then send a repair command to the node so that way it can repair itself duplicating data from other nodes.
Obviously this won't really help you if you have many nodes with problems. However, I would imagine that would be rather unlikely. (Errors do not happen so often that I could imagine all the nodes failing all at once.)
Snap! Websites
An Open Source CMS System in C++