Watchdog for a snap cluster [core]

Description

A standalone tool runs in the background of all your computers in a Snap! cluster, automatically started by snapinit as a backend. This tool gathers data to report about the current and past health of each server. Contrary to a backend, this tool is expected to run on ALL servers so we can have a better idea of how they run.

At this time the data gathered will include:

  • Network connectivity
    • Connectivity of various expected services and computers
    • Connectivity of local servers not directly related to Snap! Websites (i.e. SSH, SMTP, etc.)
  • Memory usage (Done)
    • Total amount of memory available on server
    • Used memory
    • Free memory
    • Cached memory
    • Memory used for Buffers
    • Swap memory (total and still available)
  • CPU usage (Done)
    • 1 min., 5 min., 15 load average
    • Amount of time spend running user and kernel code
    • Amount of time spend idle
    • Number of pages that got swapped out (i.e. read-only text and data pages)
    • Number of pages of data that go swapped (i.e. read-write data saved to swap)
    • Time when the system was last booted
    • Total number of processes
    • Total number of running processes and of blocked processes
  • Disk usage (Done)
    • total disk space available
    • disk space left per partition
  • Whether processes are running; we offer the following by default (Done)
    • Apache running (for front end servers)
    • Snap Init running
    • Snap Server running
    • Snap Backup running
    • Cassandra running (for Cassandra nodes)
  • Status of the Linux OS
    • Packages that are out of date
    • Whether a reboot of this server is expected

This data gathered by each snapwatchdog is shared between servers using the Cassandra database. The network connectivity, however, is used to make sure that all the servers are always accessible and running (i.e. healthy). The watchdog tool can also run on other servers (as in servers that are not part of the Snap! cluster)

Command Line

The snapwatchdog is a daemon. It is expected to be started by the snapinit tool and always permanently.

The command line options are listed when using --help (-h).

Use the --debug (-d) option to avoid having the daemon detach itself. This allows you to more easily debug in your console. However, if you want to debug the sub-processes, make sure to compile the snapwebsites library in debug mode and with the SNAP_NO_FORK option. That will allow you to run all the children in the same process instead of a separate one. Note, however, that the plugins can only be initialized once so you can run things only once with the no-fork feature turned on.

The main other useful command line option is --config (-c) to define the path and filename of a configuration file different from the default.

Setup (/etc/snapwebsites/snapwatchdog.conf)

Technically, the tool is built with a set of simple "plugins" each of which gathering the data as presented in the list above. The memory, CPU and disk uage is simply gathered using standard Linux APIs (procps, /proc/mounts, statvfs). The processes plugin checks the list of running processes and uses a regular expression to see whether certain processes are indeed present in the task table.

For the network connectivity, this is done using the STATUS message from the snapcommunicator tool (see Inter-process Signalling [core]).

The following described the various variables supported in the configuration file:

log_config=<path>

log_config defines the path to the logger (log4cplus) configuration file. By default this is set to /etc/snapwebsites/snapwatchdog.properties.

cassandra_host=<IP>
cassandra_port=<port>

The IP and port of a Cassandra node.

Once available, this will change to access a Cassandra driver instead.

statistics_period=<seconds>
statistics_frequency=<seconds>
statistics_ttl=<seconds>

The statistics variables define how often and how much data is gathered.

This will in part depend on your setup. With a smallish system (i.e. 6 servers), you can end up with 85Mb of data in one week (which is the default setup). So you may want to limit the setup to one or two days instead when running a really large cluster.

The period defines the amount of time the system is to keep your statistics data. By default it is setup to one week (604800 seconds).

The frequency is how often the data is to be read. The minimum allowed is once every 60 seconds (once a minute.) The ticking makes use of our poll() and a snapcommunicator_timer which should be precise to about 100ms within a run.

The ttl parameter is used with Cassandra to define the TTL of the data saved in the cluster. It should be set equal to the period, but could be larger or smaller. Note that larger does not really make much difference because old data automatically gets overwritten with new data. Smaller generates a moving window.

data_path=<path>

A path to a directory where the snapwatchdog daemon can write. This is generally defined under /var/lib/snapwebsites/snapwatchdog.

The watchdog will save all of the statistics gathered in that directory. The period defines the number of files that are to be saved in here (i.e. by default the number of minutes in 1 week: 10,080 files, each are about 2Kb with the current core plugins; so around 20Mb of data.)

This is a copy of the data being saved in the Cassandra database.

If <path> does not exist or is not writable by the snapcommunicator tool, then the snapcommunicator silently forfeit saving this file.

plugins_path=<path>:...

A list of colon separated paths to the plugins of the snapwatchdog daemon. By default this is /usr/lib/snapwebsites/watchdog_plugins (which is where the snapwatchdog package places the plugins.)

You may want to change this parameter in a development system to your BUILD/dist/lib directory. Make sure to use a full path to be sure that the right items get loaded.

plugins=<name>,...

A list of plugin names separated by commas. The default is to include all the plugins available in the core system. You may add your own or not include all the plugins to somewhat alleviate the amount of space used on the local disk and in your Cassandra nodes.

The default is: cpu,disk,memory,network,processes

Important Note: the plugins_default parameter is available but cannot be used because the snapwebsites library reacts by trying to load plugins from the sites table, but without a valid URI, and there are no such URI in snapwatchdog, it fails.

snapwatchdog_udp_signal=<IP>:<port>

The UDP address and port to be used by this Snap! Watchdog daemon. The default is 127.0.0.1:4099.

This information is used to open a UDP port waiting for a STOP command. That way you can cleanly stop the snapwatchdog daemon. Sending a signal may damage a file or the Cassandra database.

snapcommunicator_listen=<IP>:<port
snapcommunicator_signal=<IP>:<port>,...

The listen parameter defines the IP address and port to connect with the snapcommunicator via a TCP/IP connection. This connection is used to request STATUS messages from the snapcommunicator and thus know about the current status of the network.

The signal is a UDP/IP connection port that can be used to send a PING or similar message to any snapcommunicator or service registered with a snapcommunicator. At this time, this information is not used by the snapwatchdog.

watchdog_processes=<name>,...

Comma separated list of processes to listen to.

The <name> parameter is actually a regular expression as supported by Qt. This value can also include a process standardized name if written before a colon as in:

cassandra:java.*apache-cassandra-[0-9]+.jar

This is used so we know that the regex is used to find Cassandra. Entries that have very simple regular expressions (i.e. "apache2") generally do not use a standardized name.

Other system of interest

Ubuntu offers a system called Juju that may be useful to manage your servers without having to log in each server one by one using SSH or otherwise.

Syndicate content

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly