Check_MK Monitoring Part IV – Tuning Guide

In the first two parts of this series we gave you an overview of Check_MK and showed you how to install and navigate.  In Part III we went over the different check types.   In this post we’ll focus on information that’s hard to find on the ‘Net:  How to make Check_MK useful for you right away.

The goal of this page is to provide you with enough knowledge so that you can minimize monitoring noise / false-positive alerts, the dreaded out-of-the-box experience of just about any monitoring system before you’ve done the configuration and tuning work.

Check_MK is extremely customizable and my goal in this post is to present some of the most useful settings and rules, out of the many possible ones.  I also make a security suggestion and offer some initial checks to turn on that are going to be off by default.

CAVEAT EMPTOR  

This are the settings that worked for me after tuning my setup at a client site monitoring 500+ hosts and 20,000 services in one instance.  It is based on my own experience and there’s nothing official about these best practices.  Hopefully after you’ve read this series you’ll have a solid starting point, though!

 

Global Settings

The below items are found under WATO -> Global Settings.

First, set a fallback address.

A fallback address is where all notifications go when not configured to go anywhere else, and it takes the form of an e-mail address.   Set this now.  Use a ‘less important’ e-mail group address.  Over time, your notification rules will grow complex as they evolve, and it’s good to know you’re not missing anything unexpected!

Increase the TCP Agent Timeout

We say that Check_MK is an ‘agentless’ monitoring system because there isn’t a daemon that runs, as it does for Nagios and Icinga2, that periodically ‘phones home’ to the monitoring server.  With CMK, instead, the monitoring server periodically connects to port 6556 of the hosts it’s monitoring and parses the ASCII output.   The problem with this is that sometimes hosts get very busy, and given that CMK reaches out to the hosts every minute, there’s a non-zero chance that it might take more than 5 seconds to initiate a TCP connection.  Your mileage may vary, but in my environment I found that 30 seconds was a sweet spot to lower the amount of false positives we received for some of the very busy servers (eg., nginx and other edge servers).

Custom Localizations

More of an aesthetic concern, but I found the ‘German English’ of the product sometimes made for some strange translations.  One example is “productive systems” instead of “production systems.”  I have to be honest, I’ve seen many unproductive production systems!  To adjust vocabulary from one language to the next, the CMK developers kindly gave us the Custom Localizations configuration.  It’s not just the language – you can customize the names of many components – this can make things more clear for operations folks.

Automatic Disk Cleanup

This is off by default.  I think it’s a fantastic little maintenance feature that you might as well turn on.  Over time, depending on the number of services you have, those RRD files can add up.

Noise Reduction Settings

I’d say this section of the Series will be the most impactful for your site.  The following settings take place under here in WATO, in the Monitoring Configuration.

A Word about Soft vs. Hard States

This speaks to the underlying Nagios core, and once I started tweaking this I had an ‘aha’ moment which helped me greatly reduce false positive alerts in the environment.  Read the next sections to learn how to de-couple the ‘soft’ v. ‘hard’ notifications, and understand what these are.

After you’ve chanted this setting in your environment, you can see examine the result by clicking on “Host History” or “Service History” under the Views, after clicking on a hostname or a service name.   This will show you Soft and Hard notifications that have occurred recently.   Screenshot coming.

I achieved greater results with these settings than the default ‘Flapping Detection’ (still useful in its own right, but with a separate purpose).

Now Update the Check Attempts

To change the soft and hard settings, under Monitoring Configuration, click on Maximum number of check attempts per service.  Click on Create with the top-level folder selected.    I chose the number 4, which means it takes the monitoring server 4 ‘soft states’ to finally trigger a hard state.  On my setup, that means it might take 4 minutes for the monitoring server to start alerting. Here’s what that looks like:

Now do the same for Maximum number of check attempts per host.  Again, go back to Monitoring Configuration and create a new rule for that one.

You can go back and increase these numbers (or decrease them) over time as you see fit.  Again, use the method previously described above (host and service history) to keep an eye on things before/after you do.

ANOTHER TIP:  I created yet another rule for Maximum number of check attempts per service and this time I chose the number 6 and under Services I entered Check_MK.  Since agent timeouts are *extremely likely* to occur in production, this is one more lever I pulled to lower noise in my environment.

Now how do I … ?

Add Services to Monitor

You installed the agent, right?  If you telnet to the host you’re monitoring from the monitoring service, port 6556, you get a bunch of ASCII output, right?  OK, then the next step is simply to go into WATO->Hosts and do a re-inventory.  You can ‘activate missing’ or ‘tabula rasa’ – both will add new services.

NOTE:  adding new services doesn’t mean that metrics weren’t being gathered, or that the agent wasn’t running, or that the monitoring server wasn’t triggering the running of the agent.  Adding services just means they’re now included to be used in notifications.

Send some alerts to some people, some of the time

Let’s say that I want to create a disk alert.  I look at the services in the host for the service names, like so. (As an aside, in the screenshot below the trending of disk usage is 0 because I just set it up, but it will actually attempt to make some predictions.  You can fine-tune this!)

And then I create a rule.  WATO -> Notifications -> New Rule  It looks like this.  Notice the name of the service (which I obtained by looking at the host services and adding a regular expression).  In this example I customize the e-mail subject, the recipients and a few other things, but the sky’s almost the limit with the notification rules.  One useful item is Time Period – which you should create in the section of the same name under WATO.  It will now appear as a checkbox in the notification rule.

Pro Tip:  All the notifications and ‘rules logic’ are also written to a log file, so you can understand why an alert went out the way it did (via email, or a plugin like PagerDuty, etc.)  It’ll lay out which rule(s) did and didn’t match, and the action being taken.  This is extremely useful when someone asks if you if an alert went out or when.

 /opt/omd/sites/<SITE>/var/log/notify.log

Look up what any config items mean

You can turn on “Help” by clicking on the book icon in the upper right corner of the screen.  This turns on help text for many screen, and almost everyone misses that.  It’s one of the many UI quirks.  Sorry Matias, a lit v. not-lit books icon isn’t intuitive.

Group my hosts automatically

There are several steps before this becomes automatic.
Folders make it easy to apply rules that do things like select who get notified or adjust a threshold.  Folders are easier to create rules in a hierarchical fashion, if you desire. This will become more obvious to you once you create a new rule and are asked to select a folder from the hierarchy.
  1. Use the API or manually add new hosts –  within Folders with names of the groups you want.  The API supports auto-creation of folders.  More information on the API here (on vendor site).
  2. Go to Host Groups and create host groups.  Eg. “HadoopDev”.   NOTE:  Host groups make it easier to group the views, among other things.
  3. Go to Host and Service Parameters and create a new rule – for the Assignment piece, select your folder, optionally any tags, save and activate.

Monitor a remote datacenter

I’d like to write a separate post on this, but here is a summary.  Check_MK has the ability to scale out horizontally by talking to other monitoring servers, letting them do the checks and having one monitoring server be the view or portal.  More on that on the Check_MK site.

Adjust the threshold of one of the existing checks

It’s very similar to the “Update Check Attempts” example I have above, in the sense that it involves rules.   So, each time you create a new rule.  The advantage is that, after you’ve grouped your host above, it’s very easy to apply settings adjustments to just a subset of hosts.

For example, to adjust the CPU Utilization (or Load, which are separate), head to WATO -> Host and Service Parameters -> Parameters for discovered services. There, you’ll see many items that can be adjusted.  Many of these are for services not being checked (and some require plugin installations).  In short, in our example, look for CPU utilization on Linux/UNIX.  Click on that link, and pick the items you specifically want to adjust – this check, like many, is very configurable – and then select which hosts and apply it.

Acknowledge an Alert so my colleagues don’t yell at me

I wish the CMK developers would improve the UI here.    Acknowledging an alert is not intuitive, but it’s very easy to do, luckily.

  1. Click on the name of the Service you want to ACK. The service name itself is actually a hyperlink.
  2. That takes you to the services view.  Click on the “Hammer” icon.
  3. Now in the resulting screen you MUST put a timeframe in before ACKing (eg., a few days)
  4. You can check “send notification” if you want your colleagues to see your ack via e-mail.

(At some point you’ll want to create your own dashboard to view / not view Acknowledged alerts and even their comments.)

It’s Hammer Time. 

 

Black out a host during downtime

  1. Click on the hostname of a host
  2. Click on the “Hammer” icon
  3. Click on a timeframe in the “From Now..” section.

You can schedule regular downtimes via Time Periods.  Create a Time Period in the Time Periods section in WATO, and then create a new rule under WATO -> Hosts and Service Parameters -> Monitoring Configuration -> Notification Period for services.

Install agents and plugins

There .rpm., .deb  and .MSI files for the *agent* is available under Monitoring Agents under WATO.  That also has a bunch of plugins that sometimes need to be installed (and their man pages, also available at the command line by typing, as the monitoring user, cmk -m). One example is the logwatch plugin, explained below.

You can find a slew of complex plugins (I used a postgres one which is a 50K line Perl script) which I found at the Nagios Exchange here.

(Nagios plugins use the MRPE system, explained in Part III.)

Monitor a logfile

1. mk_logwatch (available on the monitoring server in the ‘Agents’ page on WATO). That goes in /usr/lib/check_mk_agent/plugins and must be writeable.

2. /usr/lib/check_mk_agent/plugins/logwatch.cfg should and contain the name of the file to monitor, C or W or O (CRIT, WARN, or OK) and the regular expression for that level. Also if you are following my ‘Security Considerations’ below,  /usr/lib/check_mk_agent/plugins should be owned by the checkmk  or whatever user you’re running as. For example

/var/log/elasticsearch/my-thing-prod.log  C .*my funny [Rr]egex here.*

3.  In the GUI, WATO -> Logfile Pattern Analyzer, insert the hostname, the logfile, and the pattern match.

Now re-inventorize the host to test it out.  Test the new alert by looking under Notifications.

 

Hook into a directory service (LDAP and so forth)

That’s under Users -> LDAP Connections

Security Consideration

Agent runs as root (Linux)

The ‘agent’ listens on tcp/6556 by default.   On CentOS/RH 6.X and prior, it leverages xinetd.d., which multiplexes the connections and passes them to a shell script /usr/bin/check_mk_agent, which simply runs a slew of other commands to retrieve system information.   (On CentOS/RH 7.X instead of xinetd.d that’s /etc/systemd/system/check_mk@.service).   The script returns a simple ASCII output that’s formatted in a way that the monitoring server can parse and separate out services and the data returned.   (You can test that by simply telnetting locally to 6556 and seeing the output.)

This shell script also looks for executable files in various plugin and custom check directories (we discuss custom checks  Part III of the series, by the way). The issue is that the xinetd.d config (/etc/xinetd.d/check_mk) has this script run as root.

This is a problem.  First of all, while it’s true that some of the commands being run by the script need root, it’s only a few of them (like parsing /var/log/messages).  Those can easily be wrapped by a sudo command.  Next, more pressingly, any script placed in one of the plugins directories (/usr/share/check_mk_agent/{plugins,local}) or run by mrpe will blindly run as root.  That’s an unnecessary and gaping security hole if you ask me, and relatively easy to remedy.  Aside from creating a local user like ‘checkmk’ and changing the config files above, you need to give ‘checkmk’ passwordless sudo within the sudoers file, and  edit the /usr/bin/check_mk_agent script to stick sudo in front of the for line that runs the local check and the plugins (near the end of the file) as well as items like the log reader.  I’ll include the details here at some point soon.

Anything else?

Yes.  In brief:

Here are a few useful Active Checks to turn on right now, with rules you can generate straight from this screen.

  1. Turn on SSH active checks.  If you’ve read this far than you can surmise how to do that:  WATO->Host and Service Parameters->Active Checks -> Check SSH Service -> Create New Rule.   Apply that to all your hosts, click the “Timeout” checkbox.  Here’s why:  I’m suggesting you ‘tune down’ the Check_MK agent alerts, which are too sensitive to system load and lead to false positives.  And I’m also suggesting that you turn on Ping checks.  The issue there is that a host can be hosed and still respond to pings.  SSH, though, is a bit less likely.
  2. Make a new rule for Check Hosts with PING.  Same as above, under Host and Service Parameters.  This check has useful latency metrics you’ll get to keep long-term, and perhaps set an alert for at some point.

Sign up to the Check_MK Mailing List.  There a few of them.  There’s one for community support that I’ve found quite useful.  There’s an archive there too.

Curious about how to size your monitoring server?  My experience is that I/O could potentially be your bigger bottleneck due to all the RRD files that are being written to constantly, but overall I didn’t find I needed a heavy server.   See what Matias has to say on the main site here.

Next Steps

SNMP Traps add a new dimension to your monitoring, since you can act on alerts that can be sent at any time by automated processes or applications.  Check out Part V of the series here to learn more.

 

Upcoming posts:

  • How to create custom dashboards
  • How I made this site with WordPress statically.

Stay tuned!