Check_MK Monitoring Part I – Introduction

You may have come here because you’re looking to set up a monitoring system. Monitoring systems tend to be viewed as inflexible, hard to learn and time consuming to customize.  That doesn’t have to be the case.

NOTE:  if you’ve already used Check_MK or just want to jump straight to the technical aspects, head over to   Part II here.

In this series we’ll explore one open source solution (with enterprise options) that you can, with a little guidance, set up to fully monitor your IT infrastructure.  You’ll have alerting of system-level, application and hardware components, be able to quickly spin up new custom alerts you create yourself, keep custom metrics for the long term, monitor remote sites and more, with a 60 second alert time resolution.

Never heard of CMK, or tried CMK in the past and found it difficult to work with?  Worried that it’s only for “systems” or “networks” and you’re more interested in application monitoring?  Still, read on.

Be wise: customize

Application and systems environments are often unique (and not-so-beautiful) snowflakes.

One of the reasons why monitoring is such a pain is because each system suffers from an unnecessarily high learning curve before you can feel comfortable customizing it.

All that time you spent learning Netcool or Nagios or (other system you inherited) doesn’t translate to instant comfort the first time you start looking at Zenoss, or Sensu or …

But most importantly, unlike many out-of-the-box software experiences, enterprise monitoring requires extensive customizations.  Left open, you’ll get a Pandora’s Box explosion of false positives, mis-tuned alerts, missing metrics or worse – a complex system people are scared to make changes to.

Why I chose Check_MK

.. and not barebones Nagios or Icinga2?

Customized Dashboard Specific to the team I was working with.

Check_MK is an automation system that offers numerous enhancements on top of a barebones Nagios (or Icinga2) installation:

  • A GUI plug-in system (called WATO) so you needn’t worry about hand-editing the configuration files.  The configuration files themselves are already easier to comprehend thanks to their abstracting the otherwise obtuse nagios .cfg files.  They take the form of Pythonic .mk files.
  • Instant Graphs My favorite:  the built-in graphs of metrics.  For every metric I gather, the system automatically produces a graph with a 5 minute granularity (based on 1 minute datapoints unless I say otherwise).
  • This includes custom checksSo long as the output of your monitoring script is correct, Check_MK will automatically stick the datapoints in an RRD file behind the scenes and present a graph in the Views.
  • If you want to receive asynchronous SNMP traps, you can configure the system to do that.  You can then take advantage of the complex rules engine to parse the traps, and finally you configure the notification rules of Check_MK to do something about it.  With a little customization you can do just about anything you need to do.
  • No database.  Sorry, but as a sys admin I hate having to worry about shaving, trimming and washing my database, and changing its diapers (dumps – get it?).  Check_MK sticks metrics in RRD files, and the host and rules configuration in Pythonic .mk files.  There’s some other state kept in cache and other files on disk. Do you really want to maintain another postgres/redis/MySQL/blah database?

Custom Dashboard for a distributed database product called Vertica, with some checks I wrote for it.

Types of checks

Even the open source version of Check_MK comes with hundreds of built-in, and pluginable checks that you can use to just gather metrics, or wake you up in the middle of the night.  Hopefully the former.

The agent and passive checks:  out of the box, I have a slew of powerful checks.  The “agent” (a simple shells script) alone gives me several dozen metrics, on which I can of course create alerts, such as:

  • CPU (separately load and utilization percentage)
  • Memory
  • Disk (usage but also IO, including trend predictions!)
  • Network utilization down to the interface
  • TCP metrics (including a breakdown)
  • and much, much more

The server itself can perform checks, called Active Checks, and out of the box I have a slew of built-in ones.  A tiny sample:

  • HTTP metrics (if I enable them, and I can also enable a breakdown, more on this in another Post)
  • Custom SQL checks
  • Logfile parsers

I can also very quickly create a custom check (something which I explain in part III), itself automatically gathering metrics, using the scripting or programming language of my choice.  It’s trivial to accomplish and pleases your boss / clients when they see that graphs are being kept that you can now use for capacity management, trending and so forth.

Finally – all the Nagios plugins (from the Nagios Exchange) are compatible with Check_MK, so rather than write your own Postgres plugin, just download someone else’s.

Here are some of the ones that come out of the box, just to give you a sense of the scale:

 

Next Steps

Check the “Getting Started” page for installation instructions and first-things-to-do-type tips in Part II of the series.

We go over the check types in much more detail in Part III of this series.  In subsequent series we’ll talk about the very important topic of how to actually tune your monitoring so you don’t drown in noise and false positive alerts!