Icinga2 distributed monitoring: a few things to check

A few things to check, or things that weren't obvious to me at first, when running a distributed master/satellite configuration.

Disclaimer: I started using icinga2 before Icinga Director came out. From what I understand, Director really helps simplify satellite configurations. However, I have the habit of text config files, and haven't (yet) explored Director.

0- Why use satellite configurations

Those who used Nagios in the past are probably used to having all checks run from a central node, with some local checks (such as free disk space, memory, processes) run by a local NRPE agent that does passive checks. With icinga2, we install icinga2 on each node. It replaces NRPE, in that it can do local checks, but it can also do two-way communication with the icinga2 master. i.e. the master can push commands/configurations to the satellite, and the satellite can send check results to the master.

This also helps for self-remediation: we can use the event_command from icinga2 to run local scripts when an incident is detected. For example, php-fpm is pretty reliable, but once every two months, it might mysteriously crash on one of the 70+ servers I monitor. It's not often enough to justify further investigating the issue. Maybe an upgrade will eventually fix it. Meanwhile, icinga2 can trigger an event_command to automatically restart php-fpm (sudo systemctl restart php7.2-fpm). It's also helpful for nudging Lets Encrypt https renewals (there is a known bug in one application I'm using), forcing Debian apt upgrades (unattended-upgrades is great, but configuring it for each and every third-party repo is a pain, or sometimes package maintainers flag packages as "needing manual upgrade" for no reason), etc. Sure, it's duct-tape, and the underlying issues should be fixed, but there are only so many hours in a day and I already spend a ton of time sending things to various upstream.

1- Keep all icinga2 configurations on the master node.

I used to deploy all my satellite configurations using Ansible. However, we can store configurations in the /etc/icinga2/zones.d/global-templates directory (create it if necessary). They will automatically get deployed to satellites when the master node is reloaded (if zones are properly configured). This greatly simplifies things for me, because I help manage infrastructures of a few large organisations, each with their Ansible playbooks, and having to edit each of their playbooks was tedious. That said, custom check scripts still need to be deployed with Ansible.

2- Zone configurations

When I wanted to enable zone configurations by adding configuration files to /etc/icinga2/zones.d/global-templates, I was getting errors such as:

Error: Object 'php_fpm' of type 'CheckCommand' re-defined: in /etc/icinga2/zones.d/global-templates/commands/php_fpm.conf: 1:0-1:28;
previous definition: in /etc/icinga2/zones.d/global-templates/commands/php_fpm.conf: 1:0-1:28
Location: in /etc/icinga2/zones.d/global-templates/commands/php_fpm.conf:

So something was causing the files to get included twice, but where? The syntax check in debug mode can help:

 icinga2 daemon -C -x debug

It helped confirmed that my config file was being included twice, but didn't say exactly where. Although from the debug output, it seemed like it was including the global templates from a separate procedure, so I changed my icinga2.conf configuration:

include "zones.conf"
include "zones.d/*.conf"

# and not this:
# include_recursive "zones.d"

From different examples online, I wasn't seeing an include to the zones.d/*.conf at all, but if I removed, then many of my checks would not work, although that may be because I have a lot of checks defined from the master that then run on the satellite (because I didn't know about global templates).

3- Debug tricks

Obviously, enabling debugging on the satellite/node will help to find out more information about the error. That said, I would often not notice small details when looking at the log, because of the crazy amount of data. A simple grep helps:

tail -f /var/log/icinga2/debug.log | grep restart

Example:

notice/Checkable: Executing event handler 'restart_php_fpm' for checkable 'test-node.bidon.ca'
notice/Process: Running command '/usr/lib/nagios/plugins/restart_php_fpm' '-S' '' '-a' '1' '-s' 'CRITICAL' '-t' 'SOFT': PID 23042
notice/Process: PID 23042 ('/usr/lib/nagios/plugins/restart_php_fpm' '-S' '' '-a' '1' '-s' 'CRITICAL' '-t' 'SOFT') terminated with exit code 128
warning/PluginEventTask: Event command for object 'test-node.bidon.ca' (PID: 23042, arguments: '/usr/lib/nagios/plugins/restart_php_fpm' '-S' '' '-a' '1' '-s' 'CRITICAL' '-t' 'SOFT') terminated with exit code 128, output: execvpe(/usr/lib/nagios/plugins/restart_php_fpm) failed: No such file or directory

There were two errors in the above: 1) the actual script was not deployed correctly (no such file or directory), and 2) it was missing the -S argument, so the script wouldn't have worked.

Another useful trick is using the icinga2 object list command to make sure that the configurations have been received:

# icinga2 object list | grep fpm
Object 'restart_php_fpm' of type 'EventCommand':
  % declared in '/var/lib/icinga2/api/zones/global-templates/_etc/events/restart_php_fpm.conf', lines 1:0-1:36
  * __name = "restart_php_fpm"
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/events/restart_php_fpm.conf', lines 5:3-10:3
    * -S = "php$phpfpm_version$-fpm"
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/events/restart_php_fpm.conf', lines 3:3-3:64
  * name = "restart_php_fpm"
    * path = "/var/lib/icinga2/api/zones/global-templates/_etc/events/restart_php_fpm.conf"
  * templates = [ "restart_php_fpm", "plugin-event-command", "plugin-event-command" ]
    % = modified in '/var/lib/icinga2/api/zones/global-templates/_etc/events/restart_php_fpm.conf', lines 1:0-1:36

If the configurations are missing from here, then they are not being deployed from the master server, which is likely a configuration error with the icinga zones.

4- Zone configurations

Since I had started with an early version of icinga2, many of my nodes didn't have a correct zone configuration to receive the global-templates. Example (from /etc/icinga2/zones.conf):

object Endpoint "icinga.symbiotic.coop" {
    host = "icinga.symbiotic.coop"
    port = "5665"
}

object Zone "master" {
    endpoints = [ "icinga.symbiotic.coop" ]
}

object Zone "global-templates" {
    global = true
}

object Zone "director-global" {
    global = true
}

object Endpoint NodeName {
}

object Zone ZoneName {
    endpoints = [ NodeName ]
    parent = "master"
}

Archives