Setting up a cluster of Cassandra nodes

Wed
10/29/14

Introduction

In order to run tests against what looks like a real cluster of Cassandra nodes, one wants to create multiple nodes and run them in parallel and run their software against that cluster.

So, I decided to create two racks on two separate computers. Each computer runs 3 VMs each run one instance of Cassandra. The VMs are installed with the most basic Ubuntu server (i.e. do not add anything when offered to install LAMP and other systems), Java, and Cassandra.

Table of Contents

Introduction
Network
- Local Network
- Host-only
Cassandra
Replication Factor
Checking Cluster
Adding Cassandra Nodes by Copy

Network

Local Network

It is possible to setup a VM to make use of the local network (LAN). VirtualBox will automatically make the necessary additions to your main system to make the VM accessible from anywhere on your entire local network. What you have to do manually, in most cases, is change your /etc/network/interfaces to make use of a static IP address on your local network. You may also use DHCP, but that means you won't know what the IP of that computer is going to be unless you dynamically add the address to a name server which I do not recommend (especially if you also want to have a strong firewall.)

Also, like with the Host-only method (see below), you want to make sure that each VM was given a different MAC address.

So when you restart your VMs, change the information to your interfaces file. Firt to edit the file:

sudo vim /etc/network/interfaces

Then change the eth0 entry to:

auto eth0
iface eth0 inet static
    address 192.168.0.20
    netmask 255.255.255.0
    network 192.168.0.0
    broadcast 192.168.0.255
    gateway 192.168.0.1
    dns-nameservers 8.8.8.8 ...
    dns-search example.com

The IP address selected here is 192.168.0.20. Make sure it matches your network. If you are using the 10.0.0.0 network, make sure to change the netmask, network, and broadcast accordingly.

Also the gateway is important if you want the VM to be able to find all computers on the network and send packets to all your LAN computers.

The dns-nameservers generally specifies two IP addresses of name servers such as the Google nameservers, or your ISP nameservers. This is important only if you want the VM to be capable of resolving Internet domain names.

the dns-search is generally your domain name.

Now restart and check that the VM is visible from any of your LAN computers by using the ping tool:

ping 192.168.0.20

You can avoid the restart by tearing down the network before you do the edit proposed here, and then restart it as follow, however, if you are using SSH to do that work, it probably won't work as expected...

sudo ifdown eth0
sudo vim /etc/network/interfaces
sudo ifup eth0

Yet, a reboot of a VM is generally really fast, so probably a good idea to make sure that future reboot will work as expected.

Host-only

Once I had one node, I just duplicated it 3 times, made sure that each had a different MAC address and a second network card. To setup the second card properly, which means setting it up as "Host Only Adapter", I had to add vboxnet0. This is done by going in the preferences:

Start VirtualBox
Click Files » Preferences
Select Network in the left column
Click on the Host-only Networks tab
Click on the + button (the first on the right side)
Select the new entry (it should already be selected)
Click on the edit button (a screw driver)
Change the settings as required, especially, turn off the DHCP, you do not want it

Once the vboxnet0 is setup, go to your VM settings, click on Network and add a new card (eth1) for which you select Host-only Networks and vboxnet0. Also, I setup the Promiscuous Mode to "All". I'm not too sure what the "Cable Connected" means, I selected it...

Now you can restart your VMs, they will not show up the eth1 network card if you added it after installing Ubuntu. We'll add it manually by editing the interfaces file:

sudo vim /etc/network/interfaces

Then append the following to the file:

auto eth1
iface eth1 inet static
    address 192.168.56.201
    netmask 255.255.255.0
    network 192.168.56.0
    broadcast 192.168.56.255

Do not set the gateway parameter, the gateway is on eth0 and it works as expected. Change the IP addresses to fit your own network settings. The address 192.168.56.x is the default offered by VirtualBox.

Now you can start eth1 with:

sudo ifup eth1

Repeat on each node, with a different IP on each node, of course. Once all the nodes have their eth1 interface up, you can test using ping to make sure they can communicate:

ping 192.168.56.202

The ping should work pretty much as is.

Note that if you know how to setup a bridge, that would be better since a bridge allows you to do a setup that works between your entire physical local network without proxies.

As you are at it, you probably want to setup a different hostname for each VM too:

sudo hostname vm202

That way it will make it easier to find them. And the IP addresses can also be added to your /etc/hosts file, that way you can add the ssh server and do 'ssh vm202'.

Cassandra

Now to setup Cassandra you want to enter a few parameters in this file:

sudo vim /etc/cassandra/cassandra.yaml

Search for the following:

cluster_name = ...

The name of the cluster must be the same on all the nodes. By default you get a name such as "Test Cluster". You may want to change that, even if you are just running tests... (i.e. give it a more precise test name.)

Warning: if you already started the Cassandra cluster, the old name is already saved in the database. Changing the cluster_name = ... parameter will thus fail. Since the apt-get install cassandra command automatically starts Cassandra for you, you are kind of stuck with the default "Test Cluster" name... To remediate, change the cluster_name variable, then delete the following two directories (BE VERY CAREFUL WITH 'sudo rm -rf ...'):

WARNING: THE FOLLOWING DELETES ALL THE DATA IN YOUR DATABASE!!! RUN AT YOUR OWN RISK -- since you are doing the setup, though, it shouldn't matter that you delete everything, right?

sudo service cassandra stop
sudo rm -rf /var/lib/cassandra/commitlog
sudo rm -rf /var/lib/cassandra/data
sudo vim /etc/cassandra/cassandra.yaml
# change cluster_name = ...
sudo service cassandra start

Now you changed the server name. Do NOT do that if you already added important data in your cluster. If you already have data, it is still possible to change the name with the following. Note that it has to be done one node at a time:

$ cqlsh
cqlsh> UPDATE system.local SET cluster_name = 'new name' WHERE key = 'local';
cqlsh> QUIT
$ nodetool flush

The flush is a good idea to make sure the data gets fully updated. After that, stop the cluster, change the name in the cassandra.yaml file, and restart the cluster.

listen_address = 192.168.56.201

Setup the listen address to this cluster IP address as defined by eth1. Each node will have a different address, of course.

seed_provider: ... seeds = "192.168.56.201,192.168.56.202,..."

Search for seed_provider. Under that parameter, you will see a seeds = "..." entry. By default it is set to 127.0.0.1 which means only the local node is used (it does not try to connect to other nodes on other computers.)

The list does not need to be complete. It should include at least one other node. It may include the current node too.

endpoint_snitch = GossipingPropertyFileSnitch

Define the fact that you want Cassandra to gossip (a lot, once per second!) between the various nodes. This gives information of the nodes in the cluster to the other nodes in the cluster.

Along this parameter, you want to name the data center and rack. This is done in another file named:

sudo vim /etc/cassandra/cassandra-rackdc.properties

The following is an example, naming the data center "DC1" and the rack "RAC1":

dc=DC1
rack=RAC1
prefer_local=true

I also suggest you turn on the prefer_local (which by default is not defined.)

On your second computer, make sure to use a different dc and/or rack name if it is part of a different data center (dc) or a different rack. Obviously, you may want to use better names if you have specific rooms or data centers (i.e. a data center in San Francisco could be called SF and one in New York, NY, for example.)

IMPORTANT NOTE: The cassandra-topology.properties and cassandra-topology.yaml files are ignored, instead the newer versions of Cassandra make use of the seed_provider parameter in the cassandra.yaml file.

Replication Factor

By default the snap_websites keyspace is created with a replication factor of 1. This is because in most cases developers are going to test on their system with a single node.

Once you are ready to add more nodes (remember that we recommend a minimum of 3, although for data safety, 4 should be the minimum.) The replication factor should be around N / 2 with at least 2 and eventually a little more than half until you get more than 10 nodes. At that point you may want to stop increasing the replication factor each time a node is added. You may still want to have more than 5 replications since that helps when a large number of users hit the servers all at once.

To do that change with CQL, you can use the following command (after starting the command line tool: cqlsh)

ALTER KEYSPACE snap_websites WITH REPLICATION =
        { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };

Make sure to change the count from 2 to whatever you want to use. At some point we will offer the root user a way to change those settings from the Browser, although really there is a problem with that in the fact that you need to also run:

nodetool repair

on each of the affected Cassandra node. It is not exactly trivial.

Checking Cluster

So, once you got all your VM ready and could ping them, you edited each Cassandra cluster settings, you are ready to test. Make sure all the Cassandra servers are running:

ps -ef | grep cassandra

And then run the nodetool command to see that your ring is working:

nodetool status

If it worked, you see one line per Cassandra node. Each line gives you the data center and rack name so you can easily see which nodes are running and which are not.

Adding Cassandra Nodes by Copy

When making a clone from a virtual machine drive, the clone has a copy of the original cassandra cluster. This means it is not compatible. You must delete the data in that cluster before you can make it a new node. The process would be:

sudo service cassandra stop
sudo vim /etc/hostname
# change the name of the host, each computer should have a different name and save
sudo vim /etc/hosts
# change the name of 127.0.0.1 to the new host name and save
sudo vim /etc/network/interface
# change the IP address of eth1, staying in the same network, and save
sudo vim /etc/cassandra/cassandra-rackdc.properties
# change the name of the data center or the rack as required
sudo vim /etc/cassandra/cassandra.yaml
# change the IP in listen_address and rpc_address,
# also change the list seeds under seed_provider
sudo service cassandra start
sudo init 6

After the restart, 'nodetool status' should show the new node in the list.

Other Resources:

Cassandra

Alexis Wilke's blog

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly

Recent blog posts

more

Snap! A C++ Open Source CMS