Snap! Websites
An Open Source CMS System in C++
In order to run tests against what looks like a real cluster of Cassandra nodes, one wants to create multiple nodes and run them in parallel and run their software against that cluster.
So, I decided to create two racks on two separate computers. Each computer runs 3 VMs each run one instance of Cassandra. The VMs are installed with the most basic Ubuntu server (i.e. do not add anything when offered to install LAMP and other systems), Java, and Cassandra.
It is possible to setup a VM to make use of the local network (LAN). VirtualBox will automatically make the necessary additions to your main system to make the VM accessible from anywhere on your entire local network. What you have to do manually, in most cases, is change your /etc/network/interfaces to make use of a static IP address on your local network. You may also use DHCP, but that means you won't know what the IP of that computer is going to be unless you dynamically add the address to a name server which I do not recommend (especially if you also want to have a strong firewall.)
Also, like with the Host-only method (see below), you want to make sure that each VM was given a different MAC address.
So when you restart your VMs, change the information to your interfaces file. Firt to edit the file:
sudo vim /etc/network/interfaces
Then change the eth0 entry to:
auto eth0 iface eth0 inet static address 192.168.0.20 netmask 255.255.255.0 network 192.168.0.0 broadcast 192.168.0.255 gateway 192.168.0.1 dns-nameservers 8.8.8.8 ... dns-search example.com
The IP address selected here is 192.168.0.20. Make sure it matches your network. If you are using the 10.0.0.0 network, make sure to change the netmask, network, and broadcast accordingly.
Also the gateway is important if you want the VM to be able to find all computers on the network and send packets to all your LAN computers.
The dns-nameservers generally specifies two IP addresses of name servers such as the Google nameservers, or your ISP nameservers. This is important only if you want the VM to be capable of resolving Internet domain names.
the dns-search is generally your domain name.
Now restart and check that the VM is visible from any of your LAN computers by using the ping tool:
ping 192.168.0.20
You can avoid the restart by tearing down the network before you do the edit proposed here, and then restart it as follow, however, if you are using SSH to do that work, it probably won't work as expected...
sudo ifdown eth0 sudo vim /etc/network/interfaces sudo ifup eth0
Yet, a reboot of a VM is generally really fast, so probably a good idea to make sure that future reboot will work as expected.
Once I had one node, I just duplicated it 3 times, made sure that each had a different MAC address and a second network card. To setup the second card properly, which means setting it up as "Host Only Adapter", I had to add vboxnet0. This is done by going in the preferences:
Once the vboxnet0 is setup, go to your VM settings, click on Network and add a new card (eth1) for which you select Host-only Networks and vboxnet0. Also, I setup the Promiscuous Mode to "All". I'm not too sure what the "Cable Connected" means, I selected it...
Now you can restart your VMs, they will not show up the eth1 network card if you added it after installing Ubuntu. We'll add it manually by editing the interfaces file:
sudo vim /etc/network/interfaces
Then append the following to the file:
auto eth1 iface eth1 inet static address 192.168.56.201 netmask 255.255.255.0 network 192.168.56.0 broadcast 192.168.56.255
Do not set the gateway parameter, the gateway is on eth0 and it works as expected. Change the IP addresses to fit your own network settings. The address 192.168.56.x is the default offered by VirtualBox.
Now you can start eth1 with:
sudo ifup eth1
Repeat on each node, with a different IP on each node, of course. Once all the nodes have their eth1 interface up, you can test using ping to make sure they can communicate:
ping 192.168.56.202
The ping should work pretty much as is.
Note that if you know how to setup a bridge, that would be better since a bridge allows you to do a setup that works between your entire physical local network without proxies.
As you are at it, you probably want to setup a different hostname for each VM too:
sudo hostname vm202
That way it will make it easier to find them. And the IP addresses can also be added to your /etc/hosts file, that way you can add the ssh server and do 'ssh vm202'.
Now to setup Cassandra you want to enter a few parameters in this file:
sudo vim /etc/cassandra/cassandra.yaml
Search for the following:
The name of the cluster must be the same on all the nodes. By default you get a name such as "Test Cluster". You may want to change that, even if you are just running tests... (i.e. give it a more precise test name.)
Warning: if you already started the Cassandra cluster, the old name is already saved in the database. Changing the cluster_name = ... parameter will thus fail. Since the apt-get install cassandra command automatically starts Cassandra for you, you are kind of stuck with the default "Test Cluster" name... To remediate, change the cluster_name variable, then delete the following two directories (BE VERY CAREFUL WITH 'sudo rm -rf ...'):
WARNING: THE FOLLOWING DELETES ALL THE DATA IN YOUR DATABASE!!! RUN AT YOUR OWN RISK -- since you are doing the setup, though, it shouldn't matter that you delete everything, right?
sudo service cassandra stop sudo rm -rf /var/lib/cassandra/commitlog sudo rm -rf /var/lib/cassandra/data sudo vim /etc/cassandra/cassandra.yaml # change cluster_name = ... sudo service cassandra start
Now you changed the server name. Do NOT do that if you already added important data in your cluster. If you already have data, it is still possible to change the name with the following. Note that it has to be done one node at a time:
$ cqlsh cqlsh> UPDATE system.local SET cluster_name = 'new name' WHERE key = 'local'; cqlsh> QUIT $ nodetool flush
The flush is a good idea to make sure the data gets fully updated. After that, stop the cluster, change the name in the cassandra.yaml file, and restart the cluster.
Setup the listen address to this cluster IP address as defined by eth1. Each node will have a different address, of course.
Search for seed_provider. Under that parameter, you will see a seeds = "..." entry. By default it is set to 127.0.0.1 which means only the local node is used (it does not try to connect to other nodes on other computers.)
The list does not need to be complete. It should include at least one other node. It may include the current node too.
Define the fact that you want Cassandra to gossip (a lot, once per second!) between the various nodes. This gives information of the nodes in the cluster to the other nodes in the cluster.
Along this parameter, you want to name the data center and rack. This is done in another file named:
sudo vim /etc/cassandra/cassandra-rackdc.properties
The following is an example, naming the data center "DC1" and the rack "RAC1":
dc=DC1 rack=RAC1 prefer_local=true
I also suggest you turn on the prefer_local (which by default is not defined.)
On your second computer, make sure to use a different dc and/or rack name if it is part of a different data center (dc) or a different rack. Obviously, you may want to use better names if you have specific rooms or data centers (i.e. a data center in San Francisco could be called SF and one in New York, NY, for example.)
IMPORTANT NOTE: The cassandra-topology.properties and cassandra-topology.yaml files are ignored, instead the newer versions of Cassandra make use of the seed_provider parameter in the cassandra.yaml file.
By default the snap_websites keyspace is created with a replication factor of 1. This is because in most cases developers are going to test on their system with a single node.
Once you are ready to add more nodes (remember that we recommend a minimum of 3, although for data safety, 4 should be the minimum.) The replication factor should be around N / 2 with at least 2 and eventually a little more than half until you get more than 10 nodes. At that point you may want to stop increasing the replication factor each time a node is added. You may still want to have more than 5 replications since that helps when a large number of users hit the servers all at once.
To do that change with CQL, you can use the following command (after starting the command line tool: cqlsh)
ALTER KEYSPACE snap_websites WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2 };
Make sure to change the count from 2 to whatever you want to use. At some point we will offer the root user a way to change those settings from the Browser, although really there is a problem with that in the fact that you need to also run:
nodetool repair
on each of the affected Cassandra node. It is not exactly trivial.
So, once you got all your VM ready and could ping them, you edited each Cassandra cluster settings, you are ready to test. Make sure all the Cassandra servers are running:
ps -ef | grep cassandra
And then run the nodetool command to see that your ring is working:
nodetool status
If it worked, you see one line per Cassandra node. Each line gives you the data center and rack name so you can easily see which nodes are running and which are not.
When making a clone from a virtual machine drive, the clone has a copy of the original cassandra cluster. This means it is not compatible. You must delete the data in that cluster before you can make it a new node. The process would be:
sudo service cassandra stop sudo vim /etc/hostname # change the name of the host, each computer should have a different name and save sudo vim /etc/hosts # change the name of 127.0.0.1 to the new host name and save sudo vim /etc/network/interface # change the IP address of eth1, staying in the same network, and save sudo vim /etc/cassandra/cassandra-rackdc.properties # change the name of the data center or the rack as required sudo vim /etc/cassandra/cassandra.yaml # change the IP in listen_address and rpc_address, # also change the list seeds under seed_provider sudo service cassandra start sudo init 6
After the restart, 'nodetool status' should show the new node in the list.
Other Resources:
Snap! Websites
An Open Source CMS System in C++