Check_MK Monitoring Part II – Install and Explore

In Part I we gave you a high-level overview of what Check_MK is capable of.  Here we’ll show you a quick installation example on Linux (CentOS 7 here) and explain some of the terminology in use by Check_MK and the rest of this series.

If you want to skip these details and jump straight to understanding the different check types, go to Part III of this series.  After you’ve traveled there, you should explore Part IV to begin tuning the system.

Getting started

Quickstart Installation

Head over to the to the vendor’s download page and grab the latest stable version of the Raw edition.  The Raw edition is the open source edition.  They offer an enterprise version which confers additional advantages, such as support, a faster engine (a replacement for the Nagios 3.5 that is currently what they are calling the “monitoring core”), HA redundancy and more.  There’s even an appliance version.  We don’t concern ourselves with any of that here, though (yet), so Raw it is.

OMD

The rpm itself is an OMD package.  OMD is just the form in which the package comes;  it’s quite clever.  It unpacks a self-contained directory structure within /opt/omd which includes all the etc, usr/bin, etc. directories relevant to the product within that directory structure.

You then run the omd command as root to create the site, which is also creates a local user and directory structure, also subsumed within /opt/omd.   If the local user runs omd, it can only stop and start its own site.  You can have more than one Check_MK site on the same server this way, which comes in handy for testing.

Rather than hold your hand through the process, the screenshot above makes clear the few commands that go into creating a new site.

As a side note, I much prefer this over the Red Hat mentality, which is to allow packages to litter the whole filesystem.  I can also install multiple sites side-by-side, owned by different users:  it’s automated the Apache configs for me.  But there’s more, I can also install multiple versions of Check_MK side-by-side, without stepping on its own toes.  (Debian and RedHat could learn a thing or two, “Alternatives” notwithstanding.)

OMD has the right idea.

Some Terminology and Concepts

Let’s briefly go over some basic concepts of the system.

WATO

The UI module on the web interface that lets you customize your monitoring without editing configuration files by hand.  Screenshots below.

Services

Each “item to check” appears as a service.  A check for /var is a service, a custom HTTP link, an SSH login checker, a ping check for a single host, and so forth – each one of these is  a single service.  You can expect a given host to have 30-40 services out of the box.

Active / Passive Checks

This seems like more of a Nagios-derived set of concepts, but in short, an active check is one that comes from the monitoring server (example, a simple ping check), and a passive check is one that is done on the client side (like a check that a process is running).  In practice, with Check_MK it’s a bit grey at times:  the monitoring server triggers, via  a single active check, a multiple of passive checks and reports back the results.   More on this in the next section.

Note:  out of the box, each check is done once per minute if we’re talking about the default host checks.

Agent

This is interesting.  On Linux, there is no daemon process that runs like with Nagios or Icinga.  It’s more clever than that.  The agent is a simple shell script, and this shell script does a bunch of checks (df’s, network checks via catting /proc files, checking the output of mdstat or mount, that sort of thing).  The output of the checks is such that it pleases the monitoring server, which parses the ASCII output and automatically inserts that in RRD files for the graphs to be generated from.

We say that Check_MK is agentless due to the lack of daemon running.   That’s good, right?

On CentOS 6.X and prior, the shell script runs via xinetd.d, which listens on (by default) TCP/6556 and runs the shell script.  By default this runs as root, see the next instalment on how to lock down the system.

Notifications

“alerts”.  That’s obvious, but what’s less obvious is they’re governed by logic rules which tie together host groupings, tags, people, frequency, notification level (warn, crit, ok), and much, much more

Rules

Rules are ubiquitously how the system is configured and what give it so much flexibility.  Whenever you go to edit the service groups, the host groups, the notifications or even the thresholds, you’re going to be taken to a create rule wizard.

Soft vs Hard notifications

A slightly more advanced topic.  The system has a concept of soft versus hard alerts, which gives you the ability to customize it so that a certain number of softs have to happen within a timeframe to trigger a hard.  This is essential when you start reducing the alerting noise on the system.

Logging in and …

Now that you have a site up (login is omdadmin / omd by the way), you have a dizzying UI.  Here are some foundational bits of information you need to chew on.

Views 

A whole section that’s mostly (but not entirely) “read-only”, which you’ll be looking at once you have your system configured.  Overview is probably your first-stop-shop, but at some point you’ll create a custom dashboard to make your colleagues (or yourself) happy.  Hosts, Host Groups, Services, Service Groups and so forth are a way to slice and dice elements of what you’re monitoring.  We’ll get into this later.  Event Console is a separate topic and where the SNMP stuff becomes interesting.

We’ll go over “dashboards” in a later article.

WATO – Configuration Overview

This is the “administrator-level” component of the system.

Hosts is there you’ll add the hosts and devices to monitor.  It’s also possible to use an API to do it: to use an API to do it. Host Tags come in handy later, they’re arbitrary tags (some built-in) that let you classify your hosts for future use in your rulesets.

Global Settings are meta-level configuration settings. We go over a few of these in part II.

Host and Service Parameters and Manual Checks are how you tune your existing checks, add new checks, and even configure meta-level stuff like renaming terms in the system, controlling check and alert frequencies, and much more.  You’ll also go here to group your hosts and services, also via rules.

Users, Contact Groups  and Roles and Permissions are where you can hook into a directory (hidden under Users->LDAP Connections instead of Global Settings where it should be) service and create administrative levels for users and even user groups.

Notifications are where you’ll create rules to send emails, page via Pagerduty, or whatnot.

Business Intelligence  we won’t address here but you can use it to aggregate services and views of them, for example as in a grid of hosts where you want a certain service level.

Backup & Restore is great, but what’s even more great is that the system takes snapshots every time you edit it, in case you (or someone else – there’s an audit log!) screws something up.  Mwwah!

Distributed Monitoring is how you set up more than one Check_MK server, for example if you would deploy one in another region or network, you can still ‘view’ all your hosts and services in one Check_MK instance.

Event Console  we don’t get into here, but while it has wider uses, I used it when I set up an SNMP-trap-receiving system.

Next Steps

We go over the check types in much more detail in Part III of this series.  In subsequent series we’ll talk about the very important topic of how to actually tune your monitoring so you don’t drown in noise and false positive alerts!