Watchdog for a snap cluster [core]

Description

A standalone tool running in the background of all your computers in a Snap! cluster, automatically started by systemd as a service. This tool gathers data to report about the current and past health of each server. Contrary to a backend, this tool is expected to run on ALL servers so we can have a better idea of how they all run and especially be warned about problems such as high CPU usage or disk filling up.

The data gathered includes:

  • Report by Email (Done)
    • Requires presence of an MTA (reported in the snapmanager.cgi Watchdog tab)
  • Network connectivity
    • Verify that snapcommunicator is running (Done)
    • Connectivity of various expected services and computers through snapcommunicator
    • Connectivity of local servers not directly related to Snap! Websites (i.e. SSH, SMTP, etc.)
  • Memory usage (Done)
    • Total amount of memory available on server
    • Used memory
    • Free memory
    • Cached memory
    • Memory used for Buffers
    • Swap memory (total and still available)
    • Report overuse of RAM and Swap memory
  • CPU usage (Done)
    • 1 min., 5 min., 15 load average
    • Amount of time spent running user and kernel code
    • Amount of time spent idle
    • Number of pages that got swapped out (i.e. read-only text and data pages)
    • Number of pages of data that go swapped (i.e. read-write data saved to swap)
    • Time when the system was last booted
    • Total number of processes
    • Total number of running processes and of blocked processes
    • Report overuse of CPU (if 1 min. remains high for a while)
  • Disk usage (Done)
    • Total disk space available
    • Disk space left per partition
    • Report of disk filling up (starts around 90% full)
    • Verify mount points (not done)
  • Package Verifications (Done)
    • Report packages that are not installed when they are expected to be (required packages)
    • Report that certain packages that can cause problems are installed when they should not be
    • Report conflicting packages (package A and B cannot be installed together for the system to work properly)
  • Flags (Done)
    • Report when a flag was raised
    • Raise a flag when something is not working right and the system has no means of fixing it (see Raising a Flag below)
  • Firewall Integrity
    • Verify that snapfirewall is running (Done)
    • Verify that ports are closed as expected
    • Verify that only known neighbor IP addresses have opened connections
    • Verify that we can add and remove IP addresses to the firewall
    • Report problems discovered
  • Cassandra
    • Verify that Cassandra is running (Done)
    • Check Cassandra's statistics
    • Report problems discovered
  • Whether processes are running (Done)
    • Report processes that are not running when they are expected to be
      • apache2
      • ntpd
      • snaplock
      • snapmanagerdaemon
      • snapserver
      • snapdbproxy
      • Various snapbackend
      • snaplog
      • snaplistd
    • Report packages that are running when they should not be
      • snapbackend processes when disabled
  • Log
    • Report large log files (Done)
    • Report certain log entries (somewhat a la fail2ban)
  • Status of the Linux OS (data coming from snapmanagerdaemon)
    • Packages that are out of date (Done)
    • Whether a reboot of this server is expected
  • Scripts for your own purpose (Done)
    • For things that are not going to be used forever or when you can't recompile everything, you can extend the snapwatchdog with scripts that verify various parts of your system and allow you to fix it (i.e. test that a file is not installed, that a configuration does not include a certain field, etc.)

This data gathered by each snapwatchdog plugin is shared between servers using the Cassandra database. The network connectivity, however, is used to make sure that all the servers are always accessible and running (i.e. healthy). The watchdog tool can also run on other servers (as in servers that are not part of the Snap! cluster.)

Command Line

The snapwatchdog is a daemon. It is expected to be started by systemd and run permanently. Since it runs the plugins once a minute (unless it takes more than a minute to run all the plugins), systemd is made to restart this system after 1 minute in case it crashes.

The command line options are listed when using --help (-h).

Use the --debug (-d) option to avoid having the daemon detach itself. This allows you to more easily debug in your console. However, if you want to debug the sub-processes in a debugger, make sure to compile the snapwebsites library in debug mode and with the SNAP_NO_FORK option. That will allow you to run all the children in the same process instead of a separate one. Note, however, that the plugins can only be initialized once so you can run things only once with the no-fork feature turned on.

The main other useful command line option is --config (-c) to define the path and filename of a configuration file other than the default (/etc/snapwebsites/snapwatchdog.conf).

Setup (/etc/snapwebsites/snapwatchdog.conf)

IMPORTANT: to change the setup, make sure to edit the following file, the main file should not be edited so updates from our environment make it as expected to your system.

/etc/snapwebsites/snapwebsites.d/snapwatchdog.conf

Technically, the tool is built with a set of "simple" plugins each of which gathering the data as presented in the list above. The memory, CPU, and disk usage data is gathered using standard Linux APIs (/proc/meminfo, /proc/mounts, statvfs). The processes plugin checks the list of running processes (using the procps library) and uses a simple name or a regular expression to see whether certain processes are indeed present in the task table.

For the network connectivity, this is done using the STATUS message from the snapcommunicator tool (see Inter-process Signalling [core]). That plugin also verifies that the snapcommunicator daemon is indeed running.

The following described the various variables supported in the configuration file:

debug=on (or do not define)

Whether the debug features should be turned ON or not. This increases the number of debug messages. Note that the debug version of the binary may also generate even more useful debug output.

In most cases, you should leave this turned off unless you are a developer or have been asked by a developer to turn it on to help find a bug.

To get all possible logs in our log file and in the console, tweaking the log properties will do better than just the debug flag. The log properties accept a TRACE level and can enable printing all the logs in your console.

log_config=<path>

log_config defines the path to the logger (log4cplus) configuration file. By default this is set to /etc/snapwebsites/logger/snapwatchdog.properties.

statistics_period=<seconds>
statistics_frequency=<seconds>
statistics_ttl=<seconds>

The statistics variables define how often and how much data is gathered.

This should, in part, be tweaked depending on your cluster capabilities. With a smallish system (i.e. 6 servers), you can end up with 85Mb of data in one week (which is the default setup). So you may want to limit the setup to one or two days instead when running a really large cluster.

The period defines the amount of time the system is to keep your statistics data. By default, it is set up to one week (604800 seconds).

The frequency is how often the data is to be read. The minimum allowed is once every 60 seconds (once a minute.) The ticking makes use of our poll() and a snapcommunicator_timer which should be precise to about 100ms within a run.

The ttl parameter is used with Cassandra to define the TTL of the data saved in the cluster. It should be set equal to the period but could be larger or smaller. Note that larger does not really make much difference because old data automatically gets overwritten with new data. Smaller generates a moving window. Remember also that Cassandra can be slow at cleaning up such data (feel free to search about tombstones.)

Since version 1.6.2, it is possible to set the ttl parameter to "off" or 0 in order to prevent saving any data to the Cassandra cluster. Until we offer a snapwatchdog data review plugin in Snap!, it is actually not useful to save that data to Cassandra. Also, the tombstones can create slowness problems which means having the ability to turn that feature off is a good idea.

The default value is "use-period" which means that the current period value is used as the ttl value.

plugins_path=<path>:...

A list of colon-separated paths to the plugins of the snapwatchdog daemon. By default, this is /usr/lib/snapwebsites/watchdog_plugins (which is where the snapwatchdog package places the plugins.) This should always be enough on a final working system.

You may want to change this parameter in a development system to your BUILD/snapwebsites/snapwatchdog/plugins directory. Make sure to use a full path so that the right items get loaded.

plugins=<name>,...

A list of plugin names separated by commas. The default is to include most of the plugins available in the core system. A few are turned off because the corresponding bundle may not be installed on such and such system.

You may add your own plugins or not include all the plugins to somewhat alleviate the amount of space used on the local disk and in your Cassandra nodes.

The default at the time of writing is: apt, cpu, disk, flags, log, memory, network, package, processes, watchscripts.

You can also have firewall and cassandra (more will come later, I'm sure, such as apache.) The processes being watched are defined in XML files installed under /usr/share/snapwebsites/snapwatchdog/processes. This allows various bundles to install a list of expected processes. Also, backends are a special case since they can be installed but remain disabled on various systems.

  • apt -- make sure that the system is up to date (Done)
  • cassandra -- make sure Cassandra is running (Done) and healthy
  • cpu -- keep track of CPU usage, if high for more than 3 minutes, generate reports (Done)
  • disk -- keep track of disk usage, if over 90% or so, start sending messages (Done)
  • firewall -- make sure snapfirewall is running (Done) and that the firewall is indeed up
  • flags -- when certain problems are detected by any part of the system, it is given the ability to raise a flag, this is a file with a set of variables saved under /var/lib/snapwebsites/snapwatchdog/flags/... Flags represent an error that should be repaired as soon as possible (Done)
  • list (from the snapserver-core-plugins package) -- verify that the list plugin works as expected, including all the parts doing their job: list::listjournal backend running and emptying the cache folder, list::pagelist backend running, MySQL "journal" table changing continuously (Done)
  • log -- make sure log files do not grow too large (Done) and check for certain errors (a la fail2ban)—although in many cases this is similar to using a flag which can be faster than having to parse logs
  • memory -- keep track of memory usage, if over 90% of RAM is used or over 50% of swap, generate reports (Done)
  • network -- make sure that snapcommunicator is running (Done) and track connectivity
  • packages -- check that required packages are installed, that unwanted packages are not installed, and whether some conflicting packages are both installed (Done)
  • processes -- make sure certain processes (daemons) are running and generate reports if not (Done)
  • watchdogscripts -- allow for additional binaries (other than plugins) and scripts to be used to verify various things are healthy on your system (Done)

The watchdogscripts plugin is used in part to allow for checking things that can't be thought of before they happen. For example, we have had a problem with the fail2ban-client process which is used in a CRON process to clean up a few things. In most cases, this should run in under one second, but once in a while it gets stuck (probably when one of the files it's working on gets swiped out by logrotate while working on it.) I have not seen it happen lately so it could be that was a bug in python and not fail2ban. In any event, for such a case, having a script to check that the process is not taking 100% of the CPU is not a bad idea. Or even, detect that it is running for more than 5 min. and kill it automatically if that happens.

Important Note: although the plugins_default parameter is available in this configuration file, it cannot be used because with a default list the snapwebsites library reacts by trying to load plugins from the "sites" table (a table in your Cassandra cluster,) but without a valid URI since there is no such URI in snapwatchdog. Using it will break the startup of snapwatchdog.

log_path=<path>
log_subfolder=<path>

The log_path and log_subfolder are used by the watchscripts plugin. Each script may generate some output and that output is going to be saved in files under "$log_path/$log_subfolder".

The default is fully managed and in synchrony with the path in the snapwatchdog.properties file. If you want to change those paths, you want to edit the snapwatchdog.properties and corresponding logrotate file so the number of files and their size remains manageable.

You should change these only if you are a packager and need varying paths on the OS you are working with.

data_path=<path>

A path to a directory where the snapwatchdogserver daemon can write. By default it is set to /var/lib/snapwebsites/snapwatchdog.

The watchdog saves all of the statistics it gathers in sub-folders of that directory. For the basic data, the statistics_period and statistics_frequency define the number of files that are to be saved in here (i.e. by default the number of minutes in 1 week: 10,080 files, each is about 8Kb with the current core plugins; so around 100Mb of data with the directory metadata overhead.)

For most, this is a copy of the data being saved in the Cassandra database.

If <path> does not exist or is not writable by the snapwatchdogserver daemon, then the snapwatchdogserver silently forfeit saving this file. It will try to create the directories if they don't exist yet.

The regular statistics data (generated by the snapwatchdog plugins) are saved under the data sub-folder.

The RUSAGE statistics data (generated by dying processes and sent via snapcommunicator) are saved under the rusage sub-folder.

The flags that are raised by various parts of the Snap! software or using the raise-flag command line tool, are savined in the flags sub-folder.

The watchdogscripts plugin offers a sub-folder named script-files where scripts are welcome to save data. For example, the test for the fail2ban-client tool uses a file there to know how long the tool has been running for. This is how it can detect that it was running for 3 minutes or more (without a cache of some form, it would not be possible to know how long on a instant plugin run.) If temporary enough, the files can instead be used under cache_path.

cache_path=<path>

Some of the data used by snapwatchdog is temporary in nature. For example, it checks whether the CPU usage is high on the current system. If so, it will send a report about it. To accomplish this feat, though, the CPU plugin needs to remember that the CPU usage was high for a while. To do so, it saves a file in the cache_path folder.

The data statistics also have a counterpart file here named last_results.txt. That file is used to save the full path to the last data file that was saved. That way we do not have to list all the data files to know which one was last. That file also includes the number of errors found in the file (displayed in the snapmanager.cgi output in the Watchdog tab.)

watchdog_processes_path=<path>

The list of processes to check is defined in a set of XML files. Each file is loaded and converted to an array of process objects managed in memory.

The path in this parameter defines the location of the XML files.

The snapwatchdog comes with one such file named watchdog-processes.xml. Other projects can add their own list of processes to watch. For example, the server has a snapserver-processes.xml file used to define the snapserver process as one that needs to be watched. It also defines the snaplock which is a mandatory dependency of snapserver and required with the server is running (i.e. by default the snaplock process by itself is not required to run.)

Important Note: The previous version of snapwatchdog used a watchdog_processes variable with a list of processes defined right there. This is not supported at all anymore. It was not scaleable.

from_email=<email>
adiminstrator_email=<email>

The email address used to send reports by emails. The reports can also be viewed through snapmanager.cgi or directly by reading the XML files (which is difficult as they are saved in the compact format to save space.)

The from_email parameter holds the email that will be used in the "From: ..." field. If you are using Postfix, any email address with your domain name will work. So, for example, you could use no-reply@example.com if your domain name is example.com. If you have someone in charge of the direct administration of your cluster, it could be his/her email address so the administrator can directly reply to him/her. Note that we only accept one email address here. If you want to email multiple people, edit your postfix setup and redirect this email to any number of users as required.

The administrator_email is the email used to send reports generated by the snapwatchdogserver daemon. In most cases, it will be emails that result from running the plugins and gathering statistics, however, there are a few other cases when you will get emails from the snapwatchdogserver environment. This email must be sent to some human. It's important if you really want to watch your cluster's health. Note that every single computer needs to have that parameter setup properly. This means using the Save Everywhere button when you edit this value in snapmanager.cgi is not a bad idea. As with the from_email, if you want to send these reports to multiple people, edit your Postfix settings and make sure that all the necessary administrators get a copy of the email.

watch_script_starter=<path to starter>
watchdog_watchscripts_path=<path>
watchdog_watchscripts_output=<path>

We expect all the binaries and scripts found under watchdo_watchscripts_path (default is /usr/share/snapwebsites/snapwatchdog/scripts) to be started by the watchdogscript plugin to get a default set of parameters defined before it starts. Part is done internally and part is done by the starter script.

In other words, the watchdog scripts are not directly started from the plugin. Instead, the daemon runs:

$watchdog_script_starter "<name of script to run>"

This starter scripts (default is /usr/bin/watch_script_starter) includes a set of defaults as defined in /etc/default/snapwatchdog and then runs your script if it exists. If your script is not an executable, it starts it using the standard shell:

/bin/sh /path/to/your/script/script-name

The script is expected to print to stdout and/or stderr in case it discovers any problem. This is very similar to what CRON does (i.e. if you have a CRON script that prints out data, it gets emailed to the administrator or root if no other CRON email was defined.)

The output also gets saved under the watchdog_watchscripts_output directory. This way it is easily accessible by the administrator for review. The default is /var/lib/snapwebsites/snapwatchdog/script-files.

disk_ignore=<partition regex>,...

In some cases, a server may have a partition which looks like it is full because it is a form of virtual disk (i.e. /proc). In some cases, those special partitions look like they are full and snapwatchdog ends up generating errors for them.

This parameter allows you to enter one or more regular expressions separated by a colon (:) that match the name of one or more partitions and avoid the corresponding warnings.

For example, on my computer snapd generates a few partitions which are 100% full. Those appear as /snap/core/<id>. To avoid them, I can write:

disk_ignore=^/snap/core/

(Note: This is actually an internal regex and it is not required in the disk_ignore=... parameter.)

Note that these are regular expressions but they do not require start and end delimiters. This allows for slashes to work within regex as a regular character. We use the QRegExp class so the regular expressions supported may not be as advanced as some others (like in the perl environment.)

Raising Flags

The snapwatchdog is a tool used to detect errors at the time it runs. Unfortunately, some errors occur at any time in other parts of the software. In order to record those important errors, we offer a flag mechanism which generates a sticky error.

The way it works is very simple: raising a flag is equivalent to creating a file under:

/var/lib/snapwebsites/snapwatchdog/flags/...

The file includes a set of variable names and values (name=value) just like a configuration file does. The following is an example:

# This file was auto-generated by snap_config.cpp.
# Making additional modifications here are not likely
# be overwritten, assuming the tool handling this
# configuration file is not actively working on it.
count=3
date=1536169494
function=main
hostname=snap1
line=448
manual_down=no
message=could not find an MTA to send email with
modified=1536169494
name=mta-missing
priority=5
section=mail
source_file=/home/snapwebsites/snapcpp/snapwebsites/snapwatchdog/tools/raise_flag.cpp
tags=email,mailserver
unit=firewall
version=1.6.1.7

To ease the creation of the file, we offer two interfaces. First, you can use a command line tool named raise-flag. This allows you to raise a flag, drop a flag (turn down), and to list the currently existing flags (read as error to be fixed.)

The other interface if for binaries. We offer a C preprocessor macro that creates a flag object where you can save the various parameters a flag is expected to include. Then call the save() function and voilà the flag is created with all the parameters and the snapwatchdog flag plugin will pick it up as expected. Of course, the raise-flag command line tool uses that macro.

Parameters To Supply

The flag object implementation automatically adds fields to the flag files. What you have to supply are the following parameters:

manual_down=yes|no (optional, defaults to "no")

Whether the flag will automatically be taken down (manual_down=no) or whether the administrator will have to manually delete the file using the raise-flag command line as follow (manual_down=yes):

raise-flag --down firewall mail mta-missing

The three names (firewall, mail, mta-missing) are the same as defined in the flag file. These represent the origin and reason for the flag.

message=<text message> (mandatory)

The message explaining what's going on. This should be as verbse as possible to allow the administrator to fix the problem quickly. If too complicated for the flag itself, you are welcome to include a URI to a page with explanations.

unit=<name>

The part that generated the flag. In most cases this is the name of the package. In our example here it is set to firewall which is a package.

The <name> parameter is limited to A-Z, a-z, 0-9, and dashes. Also a name can't start with a digit or a dash ("0cat" or "-sun" are not valid.) The name can't end with a dash ("dog-" is wrong.) The name can't include two dashes in a row ("lit--bun" is wrong.)

section=<name> (mandatory)

The name of the section generating the error. In most cases, this is a sensible part of the unit which generated the error. If the flag is raised by a plugin, this would be the name of that plugin.

The section <name> parameter has the same restriction as the unit <name> parameter.

name=<name> (mandatory)

The name of the flag. In other words, this represents the name of the error this flag is referring to.

The name must be composed of letters, dashes and digits. Especially, it can't include spaces. To better describe the error, use the message=... parameter instead.

This name can easily be used by other tools to determine what this flag is about. However, it may not be unique. A fully unique name is defined by the unit, section, and name, all included (i.e. you could create a string such as "<unit>:<section>:<name>".) To make it unique to a computer, include hostname to that name (i.e. "<hostname>:<unit>:<section>:<name>".)

The <name> parameter has the same restriction as the unit <name> parameter.

tags=<name>,... (optional)

It is possible to place the flag in various groups and in most you want to do so by specifying at least one tag, preferably 2 or 3.

For example, an error in link with emails should include the "email" tag. Tags are separated by commas. The number of tags is not limited, although frankly this is not a war or anything. Much more than 3 or 4 is probably not going to be useful.

priority=<number 1 to 100> (optional, defaults to 5)

When the snapwatchdog flag plugin picks up a flag, it reports it as an error and uses this priority when doing so. The default priority is 5, which is very low. Remember that any error with a priority under 50 is not emailed more than once a day. Priorities under 10 are never emailed on their own. So if you raise a very important flag, remember to set the priorty to a number between 95 and 100. If important, but can wait a little, use 75 to 94. If important enough it should be reported to the administrator, use a number between 50 and 74. If not too important, use a number between 10 and 49. For lesser flags, use a number from 1 to 9.

Automatic Parameters

The other fields of the flag file are set automatically when you create a flag file.

Some of the parameters are generated by the C preprocessor macro and others are generated by the class implementation.

source_file=<path and name to source file>

Define the path and filename to the source file where the flag was raised. This is useful for developers to find the location and thus the exact reason why the code decided to raise the flag.

In case of the raise-flag tool, it shows that file so it won't help to determine which script called the raise-flag tool.

Note that the raise-flag command line too allows you to change the source filename using the --source-file command line option. In most cases you will want to pass $0 as the parameter of this option, optionally, only pass the basename:

raise-flag --source-file `basename $0` ...

Note that only passing the basename can be confusing if multiple instances of that script can be found on that computer.

function=<function name>

Define the name of the function raising the flag. The macro takes care of that. The raise-flag command line tool calls the macro from the main() function so this parameter will be set to main in that case.

Note that the raise-flag command line too allows you to change the function name. Many shell script, though, do not use functions. In that case, you may want to consider using a placeholder such as the name of your script section where the flag is being raised.

bash gives you the stack of functions being processed. You can access the ${FUNCNAME[0]} variable as a result and that is the name of the current function. If you create a raise-flag function, then you can use ${FUNCNAME[1]} to get the name of the calling function.${FUNCNAME[0]}

line=<line number>

Define the line at which the call is made. The macro takes care of that. This can be very useful for a developer to find the exact location where the flag was raised, possibly speeding up a bug fix if such is required.

Note that the raise-flag command line too allows you to change the line number information using the --line command line option. That way you can write scripts and have a valid reference to the line in that script where the flag is being raised from.

bash offers a $LINENO variable you can use with this command line option.

date=<creation date>

This is the Unix timestamp when the flag gets created.

You may re-raise the same flag multiple times in which case this dates does not get updated. This is why the raise-flag --list command shows this date as the creation date.

modified=1536169494

This is the Unix timestamp when the flag was last modified.

Contrary to the date field, this one gets updated each time the flag is raised.

count=<raised counter>

The number of times the flag was raised. If the field already exists, it gets increment before it gets saved again.

hostname=<name of host where the flag was generated>

The name of the computer on which the flag was raised.

version=<snap version>

This defines the version of snap that was used to raise this flag. For example, it could be 1.7.3.84.

Another system of possible interest

Ubuntu offers a system called Juju that may be useful to manage your servers without having to log in each server one by one using SSH or otherwise.

Snap! Websites
An Open Source CMS System in C++

Contact Us Directly