Nagios

From Wiki

Jump to:navigation, search

These instructions are adapted by Ric Charlton from those found at http://nagios.sourceforge.net/docs/2_0/ The instructions look lengthy, however they are mostly configuration files.

See also

Zabbix

Pre-requisites

  • Install a standard base system of Linux – any distro should be OK but these instructions are written using Debian Etch
  • Install the Apache web server (v2)
  • Install gcc and g++ compilers so that the application can be made
  • Download the tarballs for the latest version of Nagios (currently v2.9) and the Nagios Plugins (currently v1.4.8)
  • Root access

Installation of Nagios

  • Download the latest distribution package to your home directory on the Nagios server
  • Unpack the distribution package with the following command:

# tar xzvf nagios-version.tar.gz

  • Create the 'nagios' user – this user will be used to run the Nagios program

# adduser nagios

  • You will be prompted for a password and then some user details... you may leave the user details blank but it's advisable to enter a password ;-)
  • Create the installation directory and give ownership to the new nagios user

# mkdir /usr/local/nagios

# chown nagios /usr/local/nagios

  • Create a group for issuing commands to Nagios from the web interface... you must add your Nagios user and Apache user to this group (in my case this is www-data)

# groupadd nagioscmd

# usermod -G nagioscmd www-data

# usermod -G nagioscmd nagios

  • Run the configure script as follows:

# ./configure --prefix=/usr/local/nagios --with-cgiurl=/nagios/cgi-bin –-with-htmurl=/nagios --with-nagios-user=nagios --with-nagios-group=nagios -–with-command-group=nagioscmd

  • Check that the options are as expected and then run make to compile the application

# make all

  • Run make scripts to install Nagios, the init scripts and configure some permissions

# make install

# make install-init

# make install-commandmode

  • Do not run the final script to install the example configs as these are not needed

Installing the Plugins

  • Unpack the plugins package

# tar xzvf nagios-plugins-version.tar.gz

  • Run the configure script

# ./configure -–prefix=/user/local/nagios –-with-cgi-url=/nagios/cgi-bin

  • Run the make scripts to compile and install the plugins

# make

# make install

Setup Apache Web Server

This section may be different if you are not using a Debian-based distro

  • Create a file in /etc/apache2/sites-available called nagios which contains the following

<apache>ScriptAlias /nagios/cgi-bin /usr/local/nagios/sbin

<Directory "/usr/local/nagios/sbin">

   Options ExecCGI
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /usr/local/nagios/etc/htpasswd.users
   Require valid-user

</Directory>

Alias /nagios /usr/local/nagios/share

<Directory "/usr/local/nagios/share">

   Options None
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /usr/local/nagios/etc/htpasswd.users
   Require valid-user

</Directory> </apache>

  • Create a link to the above file in the directory /etc/apache2/sites-enabled

# cp -s /etc/apache2/sites-available/nagios /etc/apache2/sites-enabled

  • Restart Apache

# /etc/init.d/apache2 restart

  • Create the directory /usr/local/nagios/etc

# mkdir /usr/local/nagios/etc

  • Create a web-user who can access Nagios

# htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

  • Enter a password when prompted
  • Additional users may be added with the command

# htpasswd /usr/local/nagios/etc/htpasswd.users USERNAME

Configuring Nagios

Base Configuration

  • You first need to add the important config files to the /usr/local/nagios/etc folder:
######################################################
# NAGIOS.CFG - Main Config File for Nagios 
######################################################

# LOG FILE
# This is the main log file where service and host events are logged
# for historical purposes.  This should be the first option specified 
# in the config file!!!
log_file=/usr/local/nagios/var/nagios.log


# ## OBJECT CONFIGURATION FILE(S) ##
# Plugin commands (service and host check commands)
# Arguments are likely to change between different releases of the
# plugins, so you should use the same config file provided with the
# plugin release rather than the one provided with Nagios.
cfg_file=/usr/local/nagios/etc/checkcommands.cfg

# Misc commands (notification and event handler commands, etc)
cfg_file=/usr/local/nagios/etc/misccommands.cfg

# You can tell Nagios to process all config files (with a .cfg
# extension) in a particular directory by using the cfg_dir
# directive as shown below (this helps to tidy the config files):
cfg_dir=/usr/local/nagios/etc/conf.d


# ## OBJECT CACHE FILE ##
# This option determines where object definitions are cached when
# Nagios starts/restarts.  The CGIs read object definitions from 
# this cache file (rather than looking at the object config files
# directly) in order to prevent inconsistencies that can occur
# when the config files are modified after Nagios starts.
object_cache_file=/usr/local/nagios/var/objects.cache


# ## RESOURCE FILE ##
# This is an optional resource file that contains $USERx$ macro
# definitions. Multiple resource files can be specified by using
# multiple resource_file definitions.  The CGIs will not attempt to
# read the contents of resource files, so information that is
# considered to be sensitive (usernames, passwords, etc) can be
# defined as macros in this file and restrictive permissions (600)
# can be placed on this file.
resource_file=/usr/local/nagios/etc/resource.cfg


# ## STATUS FILE ##
# This is where the current status of all monitored services and
# hosts is stored.  Its contents are read and processed by the CGIs.
# The contents of the status file are deleted every time Nagios
#  restarts.
status_file=/usr/local/nagios/var/status.dat


# ## NAGIOS USER ##
# This determines the effective user that Nagios should run as.  
# You can either supply a username or a UID.
nagios_user=nagios


# ## NAGIOS GROUP ##
# This determines the effective group that Nagios should run as.  
# You can either supply a group name or a GID.
nagios_group=nagios


# ## EXTERNAL COMMAND OPTION ##
# This option allows you to specify whether or not Nagios should check
# for external commands (in the command file defined below).  By default
# Nagios will *not* check for external commands, just to be on the
# cautious side.  If you want to be able to use the CGI command interface
# you will have to enable this.  Setting this value to 0 disables command
# checking (the default), other values enable it.
check_external_commands=1


# ## EXTERNAL COMMAND CHECK INTERVAL ##
# This is the interval at which Nagios should check for external commands.
# This value works of the interval_length you specify later.  If you leave
# that at its default value of 60 (seconds), a value of 1 here will cause
# Nagios to check for external commands every minute.  If you specify a
# number followed by an "s" (i.e. 15s), this will be interpreted to mean
# actual seconds rather than a multiple of the interval_length variable.
# Note: In addition to reading the external command file at regularly 
# scheduled intervals, Nagios will also check for external commands after
# event handlers are executed.
# NOTE: Setting this value to -1 causes Nagios to check the external
# command file as often as possible.
command_check_interval=-1


# ## EXTERNAL COMMAND FILE ##
# This is the file that Nagios checks for external command requests.
# It is also where the command CGI will write commands that are submitted
# by users, so it must be writeable by the user that the web server
# is running as (usually 'nobody').  Permissions should be set at the 
# directory level instead of on the file, as the file is deleted every
# time its contents are processed.
command_file=/usr/local/nagios/var/rw/nagios.cmd


# ## COMMENT FILE ##
# This is the file that Nagios will use for storing host and service
# comments.
comment_file=/usr/local/nagios/var/comments.dat


# ## DOWNTIME FILE ##
# This is the file that Nagios will use for storing host and service
# downtime data.
downtime_file=/usr/local/nagios/var/downtime.dat


# ## LOCK FILE ##
# This is the lockfile that Nagios will use to store its PID number
# in when it is running in daemon mode.
lock_file=/usr/local/nagios/var/nagios.lock


# ## TEMP FILE ##
# This is a temporary file that is used as scratch space when Nagios
# updates the status log, cleans the comment file, etc.  This file
# is created, used, and deleted throughout the time that Nagios is
# running.
temp_file=/usr/local/nagios/var/nagios.tmp


# ## EVENT BROKER OPTIONS ##
# Controls what (if any) data gets sent to the event broker.
# Values:  0      = Broker nothing
#         -1      = Broker everything
#         <other> = See documentation
event_broker_options=-1


# ## EVENT BROKER MODULE(S) ##
# This directive is used to specify an event broker module that should
# by loaded by Nagios at startup.  Use multiple directives if you want
# to load more than one module.  Arguments that should be passed to
# the module at startup are seperated from the module path by a space.
#
# Example:
#
#   broker_module=<modulepath> [moduleargs]


# ## LOG ROTATION METHOD ##
# This is the log rotation method that Nagios should use to rotate
# the main log file. Values are as follows..
#	n	= None - don't rotate the log
#	h	= Hourly rotation (top of the hour)
#	d	= Daily rotation (midnight every day)
#	w	= Weekly rotation (midnight on Saturday evening)
#	m	= Monthly rotation (midnight last day of month)
log_rotation_method=d


# ## LOG ARCHIVE PATH ##
# This is the directory where archived (rotated) log files should be 
# placed (assuming you've chosen to do log rotation).
log_archive_path=/usr/local/nagios/var/archives


# ## LOGGING OPTIONS ##
# If you want messages logged to the syslog facility, as well as the
# NetAlarm log file set this option to 1.  If not, set it to 0.
use_syslog=1


# ## NOTIFICATION LOGGING OPTION ##
# If you don't want notifications to be logged, set this value to 0.
# If notifications should be logged, set the value to 1.
log_notifications=1


# ## SERVICE RETRY LOGGING OPTION ##
# If you don't want service check retries to be logged, set this value
# to 0.  If retries should be logged, set the value to 1.
log_service_retries=1


# ## HOST RETRY LOGGING OPTION ##
# If you don't want host check retries to be logged, set this value to
# 0.  If retries should be logged, set the value to 1.
log_host_retries=1


# ## EVENT HANDLER LOGGING OPTION ##
# If you don't want host and service event handlers to be logged, set
# this value to 0.  If event handlers should be logged, set the value
# to 1.
log_event_handlers=1


# ## INITIAL STATES LOGGING OPTION ##
# If you want Nagios to log all initial host and service states to
# the main log file (the first time the service or host is checked)
# you can enable this option by setting this value to 1.  If you
# are not using an external application that does long term state
# statistics reporting, you do not need to enable this option.
log_initial_states=0


# ## EXTERNAL COMMANDS LOGGING OPTION ##
# If you don't want Nagios to log external commands, set this value
# to 0.  If external commands should be logged, set this value to 1.
# Note: This option does not include logging of passive service
# checks - see the option below for controlling whether or not
# passive checks are logged.
log_external_commands=1


# ## PASSIVE CHECKS LOGGING OPTION ##
# If you don't want Nagios to log passive host and service checks, set
# this value to 0.  If passive checks should be logged, set
# this value to 1.
log_passive_checks=1


# ## GLOBAL HOST AND SERVICE EVENT HANDLERS ##
# These options allow you to specify a host and service event handler
# command that is to be run for every host or service state change.
# The global event handler is executed immediately prior to the event
# handler that you have optionally specified in each host or
# service definition. The command argument is the short name of a
# command definition that you define in your host configuration file.
# Read the HTML docs for more information.
#global_host_event_handler=somecommand
#global_service_event_handler=somecommand


# ## SERVICE INTER-CHECK DELAY METHOD ##
# This is the method that Nagios should use when initially
# "spreading out" service checks when it starts monitoring.  The
# default is to use smart delay calculation, which will try to
# space all service checks out evenly to minimize CPU load.
# Using the dumb setting will cause all checks to be scheduled
# at the same time (with no delay between them)!  This is not a
# good thing for production, but is useful when testing the
# parallelization functionality.
#	n	= None - don't use any delay between checks
#	d	= Use a "dumb" delay of 1 second between checks
#	s	= Use "smart" inter-check delay calculation
#       x.xx    = Use an inter-check delay of x.xx seconds
service_inter_check_delay_method=s


# ## MAXIMUM SERVICE CHECK SPREAD ##
# This variable determines the timeframe (in minutes) from the
# program start time that an initial check of all services should
# be completed.  Default is 30 minutes.
max_service_check_spread=30


# ## SERVICE CHECK INTERLEAVE FACTOR ##
# This variable determines how service checks are interleaved.
# Interleaving the service checks allows for a more even
# distribution of service checks and reduced load on remote
# hosts.  Setting this value to 1 is equivalent to how versions
# of Nagios previous to 0.0.5 did service checks.  Set this
# value to s (smart) for automatic calculation of the interleave
# factor unless you have a specific reason to change it.
#       s       = Use "smart" interleave factor calculation
#       x       = Use an interleave factor of x, where x is a
#                 number greater than or equal to 1.
service_interleave_factor=s


# ## HOST INTER-CHECK DELAY METHOD ##
# This is the method that Nagios should use when initially
# "spreading out" host checks when it starts monitoring.  The
# default is to use smart delay calculation, which will try to
# space all host checks out evenly to minimize CPU load.
# Using the dumb setting will cause all checks to be scheduled
# at the same time (with no delay between them)!
#	n	= None - don't use any delay between checks
#	d	= Use a "dumb" delay of 1 second between checks
#	s	= Use "smart" inter-check delay calculation
#       x.xx    = Use an inter-check delay of x.xx seconds
host_inter_check_delay_method=s


# ## MAXIMUM HOST CHECK SPREAD ##
# This variable determines the timeframe (in minutes) from the
# program start time that an initial check of all hosts should
# be completed.  Default is 30 minutes.
max_host_check_spread=30


# ## MAXIMUM CONCURRENT SERVICE CHECKS ##
# This option allows you to specify the maximum number of 
# service checks that can be run in parallel at any given time.
# Specifying a value of 1 for this variable essentially prevents
# any service checks from being parallelized.  A value of 0
# will not restrict the number of concurrent checks that are
# being executed.
max_concurrent_checks=0


# ## SERVICE CHECK REAPER FREQUENCY ##
# This is the frequency (in seconds!) that Nagios will process
# the results of services that have been checked.
service_reaper_frequency=10


# ## AUTO-RESCHEDULING OPTION ##
# This option determines whether or not Nagios will attempt to
# automatically reschedule active host and service checks to
# "smooth" them out over time.  This can help balance the load on
# the monitoring server.  
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
auto_reschedule_checks=0


# ## AUTO-RESCHEDULING INTERVAL ##
# This option determines how often (in seconds) Nagios will
# attempt to automatically reschedule checks.  This option only
# has an effect if the auto_reschedule_checks option is enabled.
# Default is 30 seconds.
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
auto_rescheduling_interval=30


# ## AUTO-RESCHEDULING WINDOW ##
# This option determines the "window" of time (in seconds) that
# Nagios will look at when automatically rescheduling checks.
# Only host and service checks that occur in the next X seconds
# (determined by this variable) will be rescheduled. This option
# only has an effect if the auto_reschedule_checks option is
# enabled.  Default is 180 seconds (3 minutes).
# WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
# PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
auto_rescheduling_window=180


# ## SLEEP TIME ##
# This is the number of seconds to sleep between checking for system
# events and service checks that need to be run.
sleep_time=0.25


# ## TIMEOUT VALUES ##
# These options control how much time Nagios will allow various
# types of commands to execute before killing them off.  Options
# are available for controlling maximum time allotted for
# service checks, host checks, event handlers, notifications, the
# ocsp command, and performance data commands.  All values are in
# seconds.
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5


# ## RETAIN STATE INFORMATION ##
# This setting determines whether or not Nagios will save state
# information for services and hosts before it shuts down.  Upon
# startup Nagios will reload all saved service and host state
# information before starting to monitor.  This is useful for 
# maintaining long-term data on state statistics, etc, but will
# slow Nagios down a bit when it (re)starts.  Since its only
# a one-time penalty, I think its well worth the additional
# startup delay.
retain_state_information=1


# ## STATE RETENTION FILE ##
# This is the file that Nagios should use to store host and
# service state information before it shuts down.  The state 
# information in this file is also read immediately prior to
# starting to monitor the network when Nagios is restarted.
# This file is used only if the preserve_state_information
# variable is set to 1.
state_retention_file=/usr/local/nagios/var/retention.dat


# ## RETENTION DATA UPDATE INTERVAL ##
# This setting determines how often (in minutes) that Nagios
# will automatically save retention data during normal operation.
# If you set this value to 0, Nagios will not save retention
# data at regular interval, but it will still save retention
# data before shutting down or restarting.  If you have disabled
# state retention, this option has no effect.
retention_update_interval=60


# ## USE RETAINED PROGRAM STATE ##
# This setting determines whether or not Nagios will set 
# program status variables based on the values saved in the
# retention file.  If you want to use retained program status
# information, set this value to 1.  If not, set this value
# to 0.
use_retained_program_state=1


# ## USE RETAINED SCHEDULING INFO ##
# This setting determines whether or not Nagios will retain
# the scheduling info (next check time) for hosts and services
# based on the values saved in the retention file.  If you
# If you want to use retained scheduling info, set this
# value to 1.  If not, set this value to 0.
use_retained_scheduling_info=0


# ## INTERVAL LENGTH ##
# This is the seconds per unit interval as used in the
# host/contact/service configuration files.  Setting this to 60 means
# that each interval is one minute long (60 seconds).  Other settings
# have not been tested much, so your mileage is likely to vary...
interval_length=60


# ## AGGRESSIVE HOST CHECKING OPTION ##
# If you don't want to turn on aggressive host checking features, set
# this value to 0 (the default).  Otherwise set this value to 1 to
# enable the aggressive check option.  Read the docs for more info
# on what aggressive host check is or check out the source code in
# base/checks.c
use_aggressive_host_checking=0


# ## SERVICE CHECK EXECUTION OPTION ##
# This determines whether or not Nagios will actively execute
# service checks when it initially starts.  If this option is 
# disabled, checks are not actively made, but Nagios can still
# receive and process passive check results that come in.  Unless
# you're implementing redundant hosts or have a special need for
# disabling the execution of service checks, leave this enabled!
# Values: 1 = enable checks, 0 = disable checks
execute_service_checks=1


# ## PASSIVE SERVICE CHECK ACCEPTANCE OPTION ##
# This determines whether or not Nagios will accept passive
# service checks results when it initially (re)starts.
# Values: 1 = accept passive checks, 0 = reject passive checks
accept_passive_service_checks=1


# ## HOST CHECK EXECUTION OPTION ##
# This determines whether or not Nagios will actively execute
# host checks when it initially starts.  If this option is 
# disabled, checks are not actively made, but Nagios can still
# receive and process passive check results that come in.  Unless
# you're implementing redundant hosts or have a special need for
# disabling the execution of host checks, leave this enabled!
# Values: 1 = enable checks, 0 = disable checks
execute_host_checks=1


# ## PASSIVE HOST CHECK ACCEPTANCE OPTION ##
# This determines whether or not Nagios will accept passive
# host checks results when it initially (re)starts.
# Values: 1 = accept passive checks, 0 = reject passive checks
accept_passive_host_checks=1


# ## NOTIFICATIONS OPTION ##
# This determines whether or not Nagios will sent out any host or
# service notifications when it is initially (re)started.
# Values: 1 = enable notifications, 0 = disable notifications
enable_notifications=1


# ## EVENT HANDLER USE OPTION ##
# This determines whether or not Nagios will run any host or
# service event handlers when it is initially (re)started.  Unless
# you're implementing redundant hosts, leave this option enabled.
# Values: 1 = enable event handlers, 0 = disable event handlers
enable_event_handlers=1


# ## PROCESS PERFORMANCE DATA OPTION ##
# This determines whether or not Nagios will process performance
# data returned from service and host checks.  If this option is
# enabled, host performance data will be processed using the
# host_perfdata_command (defined below) and service performance
# data will be processed using the service_perfdata_command (also
# defined below).  Read the HTML docs for more information on
# performance data.
# Values: 1 = process performance data, 0 = do not process performance data
process_performance_data=1


# ## HOST AND SERVICE PERFORMANCE DATA PROCESSING COMMANDS ##
# These commands are run after every host and service check is
# performed.  These commands are executed only if the
# enable_performance_data option (above) is set to 1.  The command
# argument is the short name of a command definition that you 
# define in your host configuration file.  Read the HTML docs for
# more information on performance data.
host_perfdata_command=process-host-perfdata
service_perfdata_command=process-service-perfdata


# ## HOST AND SERVICE PERFORMANCE DATA FILES ##
# These files are used to store host and service performance data.
# Performance data is only written to these files if the
# enable_performance_data option (above) is set to 1.
host_perfdata_file=/tmp/host-perfdata
service_perfdata_file=/tmp/service-perfdata


# ## HOST AND SERVICE PERFORMANCE DATA FILE TEMPLATES ##
# These options determine what data is written (and how) to the
# performance data files.  The templates may contain macros, special
# characters (\t for tab, \r for carriage return, \n for newline)
# and plain text.  A newline is automatically added after each write
# to the performance data file.  Some examples of what you can do are
# shown below.
host_perfdata_file_template=[HOSTPERFDATA]\t$TIMET$\t$HOSTNAME$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$
service_perfdata_file_template=[SERVICEPERFDATA]\t$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$


# ## HOST AND SERVICE PERFORMANCE DATA FILE MODES ##
# This option determines whether or not the host and service
# performance data files are opened in write ("w") or append ("a")
# mode.  Unless you are the files are named pipes, you will probably
# want to use the default mode of append ("a").
host_perfdata_file_mode=a
service_perfdata_file_mode=a


# ## HOST AND SERVICE PERFORMANCE DATA FILE PROCESSING INTERVAL ##
# These options determine how often (in seconds) the host and service
# performance data files are processed using the commands defined
# below.  A value of 0 indicates the files should not be periodically
# processed.
host_perfdata_file_processing_interval=120
service_perfdata_file_processing_interval=120


# ## OBSESS OVER SERVICE CHECKS OPTION ##
# This determines whether or not Nagios will obsess over service
# checks and run the ocsp_command defined below.  Unless you're
# planning on implementing distributed monitoring, do not enable
# this option.  Read the HTML docs for more information on
# implementing distributed monitoring.
# Values: 1 = obsess over services, 0 = do not obsess (default)
obsess_over_services=0


# ## OBSESSIVE COMPULSIVE SERVICE PROCESSOR COMMAND ##
# This is the command that is run for every service check that is
# processed by Nagios.  This command is executed only if the
# obsess_over_service option (above) is set to 1.  The command 
# argument is the short name of a command definition that you
# define in your host configuration file. Read the HTML docs for
# more information on implementing distributed monitoring.
#ocsp_command=somecommand


# ## ORPHANED SERVICE CHECK OPTION ##
# This determines whether or not Nagios will periodically 
# check for orphaned services.  Since service checks are not
# rescheduled until the results of their previous execution 
# instance are processed, there exists a possibility that some
# checks may never get rescheduled.  This seems to be a rare
# problem and should not happen under normal circumstances.
# If you have problems with service checks never getting
# rescheduled, you might want to try enabling this option.
# Values: 1 = enable checks, 0 = disable checks
check_for_orphaned_services=0


# ## SERVICE FRESHNESS CHECK OPTION ##
# This option determines whether or not Nagios will periodically
# check the "freshness" of service results.  Enabling this option
# is useful for ensuring passive checks are received in a timely
# manner.
# Values: 1 = enabled freshness checking, 0 = disable freshness checking
check_service_freshness=1


# ## SERVICE FRESHNESS CHECK INTERVAL ##
# This setting determines how often (in seconds) Nagios will
# check the "freshness" of service check results.  If you have
# disabled service freshness checking, this option has no effect.
service_freshness_check_interval=60


# ## HOST FRESHNESS CHECK OPTION ##
# This option determines whether or not Nagios will periodically
# check the "freshness" of host results.  Enabling this option
# is useful for ensuring passive checks are received in a timely
# manner.
# Values: 1 = enabled freshness checking, 0 = disable freshness checking
check_host_freshness=0


# ## HOST FRESHNESS CHECK INTERVAL ##
# This setting determines how often (in seconds) Nagios will
# check the "freshness" of host check results.  If you have
# disabled host freshness checking, this option has no effect.
host_freshness_check_interval=60


# ## AGGREGATED STATUS UPDATES ##
# This option determines whether or not Nagios will 
# aggregate updates of host, service, and program status
# data.  Normally, status data is updated immediately when
# a change occurs.  This can result in high CPU loads if
# you are monitoring a lot of services.  If you want Nagios
# to only refresh status data every few seconds, disable
# this option.
# Values: 1 = enable aggregate updates, 0 = disable aggregate updates
aggregate_status_updates=1


# ## AGGREGATED STATUS UPDATE INTERVAL ##
# Combined with the aggregate_status_updates option,
# this option determines the frequency (in seconds!) that
# Nagios will periodically dump program, host, and 
# service status data.  If you are not using aggregated
# status data updates, this option has no effect.
status_update_interval=15


# ## FLAP DETECTION OPTION ##
# This option determines whether or not Nagios will try
# and detect hosts and services that are "flapping".  
# Flapping occurs when a host or service changes between
# states too frequently.  When Nagios detects that a 
# host or service is flapping, it will temporarily suppress
# notifications for that host/service until it stops
# flapping.  Flap detection is very experimental, so read
# the HTML documentation before enabling this feature!
# Values: 1 = enable flap detection
#         0 = disable flap detection (default)
enable_flap_detection=0


# ## FLAP DETECTION THRESHOLDS FOR HOSTS AND SERVICES ##
# Read the HTML documentation on flap detection for
# an explanation of what this option does.  This option
# has no effect if flap detection is disabled.
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0


# ## DATE FORMAT OPTION ##
# This option determines how short dates are displayed. Valid options
# include:
#	us		(MM-DD-YYYY HH:MM:SS)
#	euro    	(DD-MM-YYYY HH:MM:SS)
#	iso8601		(YYYY-MM-DD HH:MM:SS)
#	strict-iso8601	(YYYY-MM-DDTHH:MM:SS)
date_format=euro


# ## P1.PL FILE LOCATION ##
# This value determines where the p1.pl perl script (used by the
# embedded Perl interpreter) is located.  If you didn't compile
# Nagios with embedded Perl support, this option has no effect.
p1_file=/usr/local/nagios/bin/p1.pl


# ## ILLEGAL OBJECT NAME CHARACTERS ##
# This option allows you to specify illegal characters that cannot
# be used in host names, service descriptions, or names of other
# object types.
illegal_object_name_chars=`~!$%^&*|'"<>?,()=


# ## ILLEGAL MACRO OUTPUT CHARACTERS ##
# This option allows you to specify illegal characters that are
# stripped from macros before being used in notifications, event
# handlers, etc.  This DOES NOT affect macros used in service or
# host check commands.
# The following macros are stripped of the characters you specify:
#	$HOSTOUTPUT$
#	$HOSTPERFDATA$
#	$HOSTACKAUTHOR$
#	$HOSTACKCOMMENT$
#	$SERVICEOUTPUT$
#	$SERVICEPERFDATA$
#	$SERVICEACKAUTHOR$
#	$SERVICEACKCOMMENT$
illegal_macro_output_chars=`~$&|'"<>


# ## REGULAR EXPRESSION MATCHING ##
# This option controls whether or not regular expression matching
# takes place in the object config files.  Regular expression
# matching is used to match host, hostgroup, service, and service
# group names/descriptions in some fields of various object types.
# Values: 1 = enable regexp matching, 0 = disable regexp matching
use_regexp_matching=0


# ## "TRUE" REGULAR EXPRESSION MATCHING ##
# This option controls whether or not "true" regular expression 
# matching takes place in the object config files.  This option
# only has an effect if regular expression matching is enabled
# (see above).  If this option is DISABLED, regular expression
# matching only occurs if a string contains wildcard characters
# (* and ?).  If the option is ENABLED, regexp matching occurs
# all the time (which can be annoying).
# Values: 1 = enable true matching, 0 = disable true matching
use_true_regexp_matching=0


# ## ADMINISTRATOR EMAIL ADDRESS ##
# The email address of the administrator of *this* machine (the one
# doing the monitoring).  Nagios never uses this value itself, but
# you can access this value by using the $ADMINEMAIL$ macro in your
# notification commands.
admin_email=nagios


# ## ADMINISTRATOR PAGER NUMBER/ADDRESS ##
# The pager number/address for the administrator of *this* machine.
# Nagios never uses this value itself, but you can access this
# value by using the $ADMINPAGER$ macro in your notification
# commands.
admin_pager=pagenagios


# ## DAEMON CORE DUMP OPTION ##
# This option determines whether or not Nagios is allowed to create
# a core dump when it runs as a daemon.  Note that it is generally
# considered bad form to allow this, but it may be useful for
# debugging purposes.
# Values: 1 - Allow core dumps
#         0 - Do not allow core dumps (default)
daemon_dumps_core=0

# ## EOF (End of file) ##
######################################################
# CGI.CFG - CGI Configuration File for Nagios 
######################################################

# ## MAIN CONFIGURATION FILE ##
# This tells the CGIs where to find your main configuration file.
# The CGIs will read the main and host config files for any other
# data they might need.

main_config_file=/usr/local/nagios/etc/nagios.cfg

# ## PHYSICAL HTML PATH ##
# This is the path where the HTML files for Nagios reside.  This
# value is used to locate the logo images needed by the statusmap
# and statuswrl CGIs.
physical_html_path=/usr/local/nagios/share


# ## URL HTML PATH ##
# This is the path portion of the URL that corresponds to the
# physical location of the Nagios HTML files (as defined above).
# This value is used by the CGIs to locate the online documentation
# and graphics.  If you access the Nagios pages with an URL like
# http://www.myhost.com/nagios, this value should be '/nagios'
# (without the quotes).
url_html_path=/nagios


# ## CONTEXT-SENSITIVE HELP ##
# This option determines whether or not a context-sensitive
# help icon will be displayed for most of the CGIs.
# Values: 0 = disables context-sensitive help
#         1 = enables context-sensitive help
show_context_help=0


# ## NAGIOS PROCESS CHECK COMMAND ##
# This is the full path and filename of the program used to check
# the status of the Nagios process.  It is used only by the CGIs
# and is completely optional.  However, if you don't use it, you'll
# see warning messages in the CGIs about the Nagios process
# not running and you won't be able to execute any commands from
# the web interface.  The program should follow the same rules
# as plugins; the return codes are the same as for the plugins,
# it should have timeout protection, it should output something
# to STDIO, etc.
# Note: The command line for the check_nagios plugin below may
# have to be tweaked a bit, as different versions of the plugin
# use different command line arguments/syntaxes.
nagios_check_command=/usr/local/nagios/libexec/check_nagios /usr/local/nagios/var/status.dat 5 '/usr/local/nagios/bin/nagios'


# ## AUTHENTICATION USAGE ##
# This option controls whether or not the CGIs will use any 
# authentication when displaying host and service information, as
# well as committing commands to Nagios for processing.  
#
# Read the HTML documentation to learn how the authorization works!
#
# NOTE: It is a really *bad* idea to disable authorization, unless
# you plan on removing the command CGI (cmd.cgi)!  Failure to do
# so will leave you wide open to kiddies messing with Nagios and
# possibly hitting you with a denial of service attack by filling up
# your drive by continuously writing to your command file!
#
# Setting this value to 0 will cause the CGIs to *not* use
# authentication (bad idea), while any other value will make them
# use the authentication functions (the default).
use_authentication=1


# ## DEFAULT USER ##
# Setting this variable will define a default user name that can
# access pages without authentication.  This allows people within a
# secure domain (i.e., behind a firewall) to see the current status
# without authenticating.  You may want to use this to avoid basic
# authentication if you are not using a sercure server since basic
# authentication transmits passwords in the clear.
#
# Important:  Do not define a default username unless you are
# running a secure web server and are sure that everyone who has
# access to the CGIs has been authenticated in some manner!  If you
# define this variable, anyone who has not authenticated to the web
# server will inherit all rights you assign to this user!
 #default_user_name=guest


# ## SYSTEM/PROCESS INFORMATION ACCESS ##
# This option is a comma-delimited list of all usernames that
# have access to viewing the Nagios process information as
# provided by the Extended Information CGI (extinfo.cgi).  By
# default, *no one* has access to this unless you choose to
# not use authorization.  You may use an asterisk (*) to
# authorize any user who has authenticated to the web server.
authorized_for_system_information=nagiosadmin


# ## CONFIGURATION INFORMATION ACCESS ##
# This option is a comma-delimited list of all usernames that
# can view ALL configuration information (hosts, commands, etc).
# By default, users can only view configuration information
# for the hosts and services they are contacts for. You may use
# an asterisk (*) to authorize any user who has authenticated
# to the web server.
authorized_for_configuration_information=nagiosadmin


# ## SYSTEM/PROCESS COMMAND ACCESS ##
# This option is a comma-delimited list of all usernames that
# can issue shutdown and restart commands to Nagios via the
# command CGI (cmd.cgi).  Users in this list can also change
# the program mode to active or standby. By default, *no one*
# has access to this unless you choose to not use authorization.
# You may use an asterisk (*) to authorize any user who has
# authenticated to the web server.
authorized_for_system_commands=nagiosadmin


# ## GLOBAL HOST/SERVICE VIEW ACCESS ##
# These two options are comma-delimited lists of all usernames that
# can view information for all hosts and services that are being
# monitored.  By default, users can only view information
# for hosts or services that they are contacts for (unless you
# you choose to not use authorization). You may use an asterisk (*)
# to authorize any user who has authenticated to the web server.
authorized_for_all_services=nagiosadmin,guest
authorized_for_all_hosts=nagiosadmin,guest


# ## GLOBAL HOST/SERVICE COMMAND ACCESS ##
# These two options are comma-delimited lists of all usernames that
# can issue host or service related commands via the command
# CGI (cmd.cgi) for all hosts and services that are being monitored. 
# By default, users can only issue commands for hosts or services 
# that they are contacts for (unless you you choose to not use 
# authorization).  You may use an asterisk (*) to authorize any
# user who has authenticated to the web server.
authorized_for_all_service_commands=nagiosadmin
authorized_for_all_host_commands=nagiosadmin


# ## STATUSMAP BACKGROUND IMAGE ##
# This option allows you to specify an image to be used as a 
# background in the statusmap CGI.  It is assumed that the image
# resides in the HTML images path (i.e. /usr/local/nagios/share/images).
# This path is automatically determined by appending "/images"
# to the path specified by the 'physical_html_path' directive.
# Note:  The image file may be in GIF, PNG, JPEG, or GD2 format.
# However, I recommend that you convert your image to GD2 format
# (uncompressed), as this will cause less CPU load when the CGI
# generates the image.
#statusmap_background_image=smbackground.gd2


# ## DEFAULT STATUSMAP LAYOUT METHOD ##
# This option allows you to specify the default layout method
# the statusmap CGI should use for drawing hosts.  If you do
# not use this option, the default is to use user-defined
# coordinates.  Valid options are as follows:
#	0 = User-defined coordinates
#	1 = Depth layers
#       2 = Collapsed tree
#       3 = Balanced tree
#       4 = Circular
#       5 = Circular (Marked Up)
default_statusmap_layout=5


# ## DEFAULT STATUSWRL LAYOUT METHOD ##
# This option allows you to specify the default layout method
# the statuswrl (VRML) CGI should use for drawing hosts.  If you
# do not use this option, the default is to use user-defined
# coordinates.  Valid options are as follows:
#	0 = User-defined coordinates
#       2 = Collapsed tree
#       3 = Balanced tree
#       4 = Circular
default_statuswrl_layout=4


# ## STATUSWRL INCLUDE ##
# This option allows you to include your own objects in the 
# generated VRML world.  It is assumed that the file
# resides in the HTML path (i.e. /usr/local/nagios/share).
#statuswrl_include=myworld.wrl


# ## PING SYNTAX ##
# This option determines what syntax should be used when
# attempting to ping a host from the WAP interface (using
# the statuswml CGI.  You must include the full path to
# the ping binary, along with all required options.  The
# $HOSTADDRESS$ macro is substituted with the address of
# the host before the command is executed.
# Please note that the syntax for the ping binary is
# notorious for being different on virtually ever *NIX
# OS and distribution, so you may have to tweak this to
# work on your system.
ping_syntax=/bin/ping -n -U -c 5 $HOSTADDRESS$


# ## REFRESH RATE ##
# This option allows you to specify the refresh rate in seconds
# of various CGIs (status, statusmap, extinfo, and outages).  
refresh_rate=90


# ## SOUND OPTIONS ##
# These options allow you to specify an optional audio file
# that should be played in your browser window when there are
# problems on the network.  The audio files are used only in
# the status CGI.  Only the sound for the most critical problem
# will be played.  Order of importance (higher to lower) is as
# follows: unreachable hosts, down hosts, critical services,
# warning services, and unknown services. If there are no
# visible problems, the sound file optionally specified by
# 'normal_sound' variable will be played.
#
# <varname>=<sound_file>
# Note: All audio files must be placed in the /media subdirectory
# under the HTML path (i.e. /usr/local/nagios/share/media/).
#host_unreachable_sound=hostdown.wav
#host_down_sound=hostdown.wav
#service_critical_sound=critical.wav
#service_warning_sound=warning.wav
#service_unknown_sound=warning.wav
#normal_sound=noproblem.wav

# ## End of File ##
#########################################################################
# RESOURCE.CFG - Resource File for Nagios 
#
# You can define $USERx$ macros in this file, which can in turn be used
# in command definitions in your host config file(s).  $USERx$ macros are
# useful for storing sensitive information such as usernames, passwords, 
# etc.  They are also handy for specifying the path to plugins and 
# event handlers - if you decide to move the plugins or event handlers to
# a different directory in the future, you can just update one or two
# $USERx$ macros, instead of modifying a lot of command definitions.
#
# The CGIs will not attempt to read the contents of resource files, so
# you can set restrictive permissions (600 or 660) on them.
#
# Nagios supports up to 32 $USERx$ macros ($USER1$ through $USER32$)
#
# Resource files may also be used to store configuration directives for
# external data sources like MySQL...
#######################################################################

# Sets $USER1$ to be the path to the plugins
$USER1$=/usr/local/nagios/libexec
# Sets $USER2$ to be the path to event handlers
#$USER2$=/usr/local/nagios/libexec/eventhandlers
# Store some usernames and passwords (hidden from the CGIs)
#$USER3$=someuser
#$USER4$=somepassword


# ## DB STATUS DATA ##
# Note: These config directives are only used if you compiled
# in database support for status data!
# The user you specify here needs SELECT, INSERT, UPDATE, and
# DELETE privileges on the 'programstatus', 'hoststatus',
# and 'servicestatus' tables in the database.
#xsddb_host=somehost
#xsddb_port=someport
#xsddb_database=nagios
#xsddb_username=nagios
#xsddb_password=password
#xsddb_optimize_data=1
#xsddb_optimize_interval=3600


# ## DB COMMENT DATA ##
# Note: These config directives are only used if you compiled
# in database support for comment data!
# The user you specify here needs SELECT, INSERT, UPDATE, and
# DELETE privileges on the 'hostcomments' and 'servicecomments'
# tables in the database.
#xcddb_host=somehost
#xcddb_port=someport
#xcddb_database=nagios
#xcddb_username=nagios
#xcddb_password=password
#xcddb_optimize_data=1


# ## DB DOWNTIME DATA ##
# Note: These config directives are only used if you compiled
# in database support for downtime data!
# The user you specify here needs SELECT, INSERT, UPDATE, and
# DELETE privileges on the 'hostdowntime' and 'servicedowntime'
# tables in the database.
#xdddb_host=somehost
#xdddb_port=someport
#xdddb_database=nagios
#xdddb_username=nagios
#xdddb_password=password
#xdddb_optimize_data=1


# ## DB RETENTION DATA ##
# Note: These config directives are only used if you compiled
# in database support for retention data!
# The user you specify here needs SELECT, INSERT, UPDATE, and
# DELETE privileges on the 'programretention', 'hostretention',
# and 'serviceretention' tables in the database.
#xrddb_host=somehost
#xrddb_port=someport
#xrddb_database=nagios
#xrddb_username=nagios
#xrddb_password=password
#xrddb_optimize_data=1

# ## End of File ##
######################################################
# NOTIFICATION COMMANDS
######################################################

# 'host-notify-by-email' command definition
define command{
	command_name	host-notify-by-email
	command_line	/usr/bin/printf "%b" "***** Nagios  *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/mail -s "Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$
	}

# 'host-notify-by-epager' command definition
define command{
	command_name	host-notify-by-epager
	command_line	/usr/bin/printf "%b" "Host '$HOSTALIAS$' is $HOSTSTATE$\nInfo: $HOSTOUTPUT$\nTime: $LONGDATETIME$" | /usr/bin/mail -s "$NOTIFICATIONTYPE$ alert - Host $HOSTNAME$ is $HOSTSTATE$" $CONTACTPAGER$
	}

# 'notify-by-email' command definition
define command{
	command_name	notify-by-email
	command_line	/usr/bin/printf "%b" "***** Nagios  *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ alert - $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
	}

# 'notify-by-epager' command definition
define command{
	command_name	notify-by-epager
	command_line	/usr/bin/printf "%b" "Service: $SERVICEDESC$\nHost: $HOSTNAME$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\nInfo: $SERVICEOUTPUT$\nDate: $LONGDATETIME$" | /usr/bin/mail -s "$NOTIFICATIONTYPE$: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$" $CONTACTPAGER$
	}

######################################################
# PERFORMANCE DATA COMMANDS
#
# These are sample performance data commands that can be used to
# send performance data output to two text files (one for hosts, another
# for services).  If you plan on simply writing performance data out to a
# file, consider using the host_perfdata_file and service_perfdata_file
# options in the main config file.
#
######################################################

# 'process-host-perfdata' command definition
define command{
	command_name	process-host-perfdata
	command_line	/usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/nagios/var/host-perfdata.out
	}

# 'process-service-perfdata' command definition
define command{
	command_name	process-service-perfdata
	command_line	/usr/bin/printf "%b" "$LASTSERVICECHECK$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATE$\t$SERVICEATTEMPT$\t$SERVICESTATETYPE$\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$\n" >> /usr/local/nagios/var/service-perfdata.out
	}

# ## End of File ##

Monitoring services

  • In this document I will be refering to the following network:

File:Dummy network.png

  • This simplified network will show how to monitor the common services. It is then easy to extend this to any number of hosts.

Keeping the config files neat and tidy

It is possible to have all the config in one file. This will work fine but can be a pain to debug and making changes are not as easy. Instead, we have already specified that the directory /usr/local/nagios/etc/conf.d will contain a number of .cfg files containing the config data.

In this example, we will split the hosts into seperate groups:

  • Servers
  • Printers
  • Switches
  • Misc

Each group can then be split into two config files containing info about:

  • Hosts
  • Services

The files for these will be called:

  • hostsServers.cfg
  • servicesServers.cfg
  • hostsPrinters.cfg
  • servicesPrinters.cfg
  • hostsSwitches.cfg
  • servicesSwitches.cfg
  • hostsMisc.cfg
  • servicesMisc.cfg

In addition to the eight files above we will also add the following files:

  • contacts.cfg
    • Who to inform
  • contactgroups.cfg
    • How the contacts are grouped
  • timeperiods.cfg
    • When to notify people
  • hostgroups.cfg
    • How the hosts are grouped
  • extinfo.cfg
    • Extended information (e.g. which images the CGIs should use for the different hosts)

By using templates we can also keep common syntax outside of the hosts and services files. This makes it quicker to add to the files and makes them considerably shorter!

  • hostsTemplates.cfg
  • servicesTemplates.cfg

Starting with the simple stuff

  • To keep things simple we will assume that there is one user that is logged onto the Nagios box as 'nagios' and that email is delivered locally.
  • Create the following files:
######################################################
# contacts.cfg - CONTACT DEFINITIONS
######################################################
# 'nagios' contact definition
define contact{
	contact_name			nagios
	alias					Nagios Admin
	service_notification_period	24x7
	host_notification_period	24x7
	service_notification_options	w,u,c,r
	host_notification_options	d,u,r
	service_notification_commands	notify-by-email
	host_notification_commands	host-notify-by-email
	email					nagios@localhost
	}

# ## End of File ##
######################################################
# contactgroups.cfg - CONTACT GROUP DEFINITIONS
######################################################
# 'windows-admins' contact group definition
define contactgroup{
	contactgroup_name	windows-admins
	alias			Windows Administrators
	members		nagios
	}

# 'linux-admins' contact group definition
define contactgroup{
	contactgroup_name	linux-admins
	alias			Linux Administrators
	members		nagios
	}

# 'switch-admins' contact group definition
define contactgroup{
	contactgroup_name	switch-admins
	alias			Switch Administrators
	members		nagios
	}
		
# 'printer-admins' contact group definition
define contactgroup{
	contactgroup_name	printer-admins
	alias			Printer Administrators
	members		nagios
	}

# 'misc-admins' contact group definition
define contactgroup{
	contactgroup_name	misc-admins
	alias			Misc Device Administrators
	members		nagios
	}

# ## End of file ##
  • It is not necessary to have multiple contact groups but it makes it easier to assign specific systems to different members of a support team.
######################################################
# timeperiods.cfg - TIMEPERIOD DEFINITIONS
######################################################
# '24x7' timeperiod definition
define timeperiod{
	timeperiod_name	24x7
	alias			24 Hours A Day, 7 Days A Week
	sunday		00:00-24:00
	monday		00:00-24:00
	tuesday		00:00-24:00
	wednesday		00:00-24:00
	thursday		00:00-24:00
	friday		00:00-24:00
	saturday		00:00-24:00
	}

# 'workhours' timeperiod definition (The work hours set are
# the times that most staff are working – if people turn their
# printers off overnight you could end up receiving notifications
# for 30mins before they get in the next day!
define timeperiod{
	timeperiod_name	workhours
	alias			"Normal" Working Hours
	monday		08:30-16:00
	tuesday		08:30-16:00
	wednesday		08:30-16:00
	thursday		08:30-16:00
	friday		08:30-16:00
	}

# 'nonworkhours' timeperiod definition
define timeperiod{
	timeperiod_name	nonworkhours
	alias			Non-Work Hours
	sunday		00:00-24:00
	monday		00:00-08:30,16:00-24:00
	tuesday		00:00-08:30,16:00-24:00
	wednesday		00:00-08:30,16:00-24:00
	thursday		00:00-08:30,16:00-24:00
	friday		00:00-08:30,16:00-24:00
	saturday		00:00-24:00
	}

# 'none' timeperiod definition
define timeperiod{
	timeperiod_name	none
	alias			No Time Is A Good Time
	}

# ## End of File ##

Configuring the checks

  • Firstly we need to create the template files which will contain the common options of the host and service definitions:
##### hostsTemplates.cfg             #####
##### Templates for Host Definitions #####
define host{
	name				generic-host ; The name of this host template - referenced in other
                                                     ; host definitions, used for template recursion/resolution
	notifications_enabled		1            ; Host notifications are enabled
	event_handler_enabled		0            ; Host event handler is disabled
	flap_detection_enabled		0            ; Flap detection is disabled
	process_perf_data		1            ; Process performance data
	retain_status_information	1            ; Retain status information across program restarts
	retain_nonstatus_information	1            ; Retain non-status information across program restarts
        check_command                   check-host-alive    ; Default host check is a quick ping
        max_check_attempts	        10
	notification_interval	        120
	notification_period	        24x7
	notification_options	        d,u,r
	register			0            ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, 
                                                     ; JUST A TEMPLATE!
	}
    
define host{
    name				windows-server	; Template for Windows Servers
    use				generic-host	; Use above template for defaults
    contact_groups	      windows-admins	; Who to notify
    register		      0
    }
    
define host{
    name				linux-server	; Template for Linux Servers
    use				generic-host
    contact_groups		linux-admins
    register			0
    }

define host{
    name				switch-template  ; Template for Managed Switches
    use				generic-host
    contact_groups		switch-admins
    register			0
    }
    
define host{
    name                     printer-template  ; Template for Printers
    use                      generic-host
    notification_period	     workhours
    contact_groups           printer-admins
    register	           0
    }

define host{
    name                     misc-device  ; Template for Misc Network Devices
    use                      generic-host
    contact_groups           misc-admins
    register			0
    }

# ## End of file ##
##### servicesTemplates.cfg        #####
##### Service Definition Templates #####
##### Generic service definition template #####
define service{
	name				generic-service
	active_checks_enabled	1	; Active service checks are enabled
	passive_checks_enabled	0	; Passive service checks are enabled/disabled
	parallelize_check		1	; Active service checks should be parallelized
					      ; (disabling this can lead to major performance problems)
	obsess_over_service	1	; We should obsess over this service (if necessary)
	check_freshness		0	; Default is to NOT check service 'freshness'
	notifications_enabled	1	; Service notifications are enabled
	event_handler_enabled	0	; Service event handler is disabled
	flap_detection_enabled	0	; Flap detection is disabled
	process_perf_data		1	; Process performance data
	retain_status_information	1	; Retain status information across program restarts
	retain_nonstatus_information	1	; Retain non-status information across program restarts
    check_period			24x7
	max_check_attempts	3
	normal_check_interval	3
	retry_check_interval	1
	notification_interval	120
	notification_period	24x7
	notification_options	w,u,c,r
	register			0
	}

##### General Service Defintion Templates #####
define service{
    name				ping-service
    use				generic-service
    service_description		PING
    is_volatile			0
    check_command			check_ping!100.0,20%!500.0,60%
    register			0
    }
    
define service{
    name                     dns-service
    use                      generic-service
    service_description		DNS
    is_volatile			0
    check_command			rmc_check_dns!www.google.co.uk!1!2
    register			0
    }

define service{
    name				proxy-service
    use				generic-service
    service_description		PROXY
    is_volatile			0
    check_command			check_squid!8080!http://www.google.co.uk
    register			0
    }
    
define service{
    name				http-service
    use				generic-service
    service_description		HTTP
    is_volatile			0
    check_command			check_http
    register			0
    }

##### Printer Checks #####   
define service{
    name					printer-status
    use					generic-service
    service_description			Printer Status
    is_volatile				0
    check_period				workhours
    max_check_attempts			4
    normal_check_interval		5
    retry_check_interval		1
    contact_groups			printer-admins
    notification_interval		960
    notification_period			workhours
    check_command				check_hpjd
    register				0
    }

# ## End of File ##
#######################################################################
# checkcommands.cfg - Nagios configuration file for local user changes
#######################################################################
##### Adapted DNS server check #####
define command{
	command_name	rmc_check_dns
	command_line	$USER1$/check_dns -s $HOSTADDRESS$ -H $ARG1$ -w $ARG2$ -c $ARG3$
	}

##### Check JetDirect Status #####
define command{
	command_name	check_hpjd
	command_line	$USER1$/check_hpjd -H $HOSTADDRESS$
	}

##### Ping-based checks #####
define command{
	command_name	check-host-alive
	command_line	$USER1$/check_ping -H $HOSTADDRESS$ -w 99,99% -c 100,100% -p 1
	}

define command{
	command_name	check_ping
	command_line	$USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
	}

##### HTTP-based checks #####
define command{
	command_name	check_http
	command_line 	$USER1$/check_http -H $HOSTADDRESS$
	}

define command{
	command_name	check_squid
	command_line	$USER1$/check_http -H $HOSTADDRESS$ -p 8080 -u http://www.google.co.uk
	}

# ## End of file ##
##### hostgroups.cfg       #####
##### Hostgroup Defintions #####
## This file groups the hosts to allow easy reference to similar hosts later
# 'windows-servers' host group definition
define hostgroup{
	hostgroup_name	windows-servers
	alias			Windows Servers
	members		WINDOW-BOX
	}

# 'linux-boxes' host group definition
define hostgroup{
	hostgroup_name	linux-boxes
	alias			Linux Servers
	members		LINUX-BOX,GATEWAY,NAGIOS
	}

# 'printers' host group definition
define hostgroup{
	hostgroup_name	printers
	alias			Printers
	members		LASER-PRINTER
	}

# 'switches' host group definition
define hostgroup{
	hostgroup_name	switches
	alias			Switches
	members		switch1,switch2,switch3
	}

# 'misc-devices' host group definition
define hostgroup{
	hostgroup_name	misc-devices
	alias			Misc Devices
	members		UPS,ROUTER,google
	}

# ## End of File ##
##### hostsServers.cfg        #####
##### Server Host Definitions #####
# 'WINDOW-BOX' host definition
define host{
	use			windows-server
	host_name		WINDOW-BOX
	alias			WINDOW-BOX (Windows Server)
	address		192.168.0.3
	parents		switch1
	}

# 'LINUX-BOX' host definition
define host{
	use			linux-server
	host_name		LINUX-BOX
	alias			LINUX-BOX (Linux Server)
	address		192.168.0.2
	parents		switch1
	}

# 'GATEWAY' host definition
define host{
	use			linux-server
	host_name		GATEWAY
	alias			GATEWAY (linux Server - Proxy)
	address		192.168.0.1
	parents		switch1
	}

# 'NAGIOS' host definition
define host{
	use			linux-server
	host_name		NAGIOS
	alias			NAGIOS (Linux Server - Nagios)
	address		192.168.0.5
	parents		switch2
	}

# ## End of File ##
##### hostsSwitches.cfg       #####
##### Switch Host Definitions #####
# 'switch1' host definition
define host{
	use			switch-template
	host_name		switch1
	alias			Switch #1
	address		192.168.1.1
	parents		switch2
	}

# 'switch2' host definition
define host{
	use			switch-template
	host_name		switch2
	alias			Switch #2
	address		192.168.1.2
	}

# 'switch3' host definition
define host{
	use			switch-template
	host_name		switch3
	alias			Switch #3
	address		192.168.1.3
	parents		switch2
	}

# ## End of File ##
##### hostsPrinters.cfg         #####
##### Printers Host Definitions #####
# 'LASER-PRINTER' host definition
define host{
	use			printer-template
	host_name		LASER-PRINTER
	alias			LASER-PRINTER (HP LaserJet)
	address		192.168.0.4
	parents		switch2
	}

# ## End of File ##
##### hostsMisc.cfg         #####
##### Misc Host Definitions #####
# 'UPS' host definition
define host{
	use			misc-device
	host_name		UPS
	alias			UPS (APC SNMP Management Card)
	address		192.168.0.6
	parents		switch1
	}

# 'ROUTER' host definition
define host{
	use			misc-device
	host_name		ROUTER
	alias			ROUTER
	address		10.0.0.1
	parents		GATEWAY
	}

# 'Google UK' host definition
define host{
	use			misc-device
	host_name		google
	alias			Google UK
	address		www.google.co.uk	; DNS MUST WORK!!!
	parents		ROUTER
	}

# ## End of File ##
  • Now that all the hosts are defined, it is necessary to define the services to be checked:
##### servicesServers.cfg        #####
##### Server Service Definitions #####
# Ping all servers
define service{
    hostgroup_name          windows-servers
    use                     ping-service
    contact_groups          windows-admins
    }
    
define service{
    hostgroup_name          linux-boxes
    use                     ping-service
    contact_groups          linux-admins
    }

# Check DNS
define service{
	host_name			LINUX-BOX
	use				dns-service
	contact_groups		linux-admins
	}

# Check Web Server
define service{
	host_name			LINUX-BOX
	use				http-service
	contact_groups		linux-admins
	}

# Check Proxy Server
define service{
	host_name			GATEWAY
	use				proxy-service
	contact_groups		linux-admins
	}

# ## End of file ##
</ore>
<pre>
##### servicesSwitches.cfg       #####
##### Switch Service Definitions #####
# We are not yet monitoring any specific services for now (just ping them)
# Ping all switches
define service{
    hostgroup_name          switches
    use                     ping-service
    contact_groups          switch-admins
    }

# ## End of File ##
##### servicesPrinters.cfg       #####
##### Printer Service Defintions #####
# Printers using 'proper' JetDirect cards
define service{
    hostgroup_name	printers
    use			printer-status
    }

# ## End of File ##
##### servicesMisc.cfg                 #####
##### Misc Devices Service Definitions #####
# Ping devices – all we will do for now ;-)
define service{
    hostgroup_name	misc-devices
    use			ping-service
    contact_groups	misc-admins
    }

# ## End of File ##
##### extinfo.cfg (you will need to get the necessary icon packs from nagiosexchange.org)
##### Extended Host and Service Information #####
define hostextinfo{
        hostgroup_name   linux-boxes
        notes            Debian GNU/Linux servers
        icon_image       rack_linux.png
        icon_image_alt   Debian GNU/Linux
        vrml_image       rack_linux.png
        statusmap_image  rack_linux.png
        }

define hostextinfo{
        hostgroup_name   windows-servers
        notes            Windows servers
        icon_image       rack_windows.png
        icon_image_alt   Windows Server 2003
        vrml_image       rack_windows.png
        statusmap_image  rack_windows.png
        }

define hostextinfo{
        hostgroup_name   printers
        notes            Network Printers
        icon_image       hp-printer40.png
        icon_image_alt   Network Printer
        vrml_image       hp-printer40.png
        statusmap_image  hp-printer40.gd2
        }

define hostextinfo{
        hostgroup_name   switches
        notes            Switches
        icon_image       switch.png
        icon_image_alt   Switch
        vrml_image       switch.png
        statusmap_image  switch.png
        }

define hostextinfo{
	hostgroup_name	misc-devices
	notes		Misc. Devices
	icon_image	black_box.png
	icon_image_alt	Misc. device
	vrml_image	black_box.png
	statusmap_image	black_box.png
	}

# ## End of File ##

Checking that the config files make sense and starting Nagios

  • Before loading the Nagios daemon, a sanity check should be performed on the configuration files:

# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

  • If there are any error messages, read the output to see what is wrong – it's usually typos ;-)
  • If this sanity check works, (re)start the daemon using the command:

# /etc/init.d/nagios restart

Extending the Configuration

Monitoring Switches

For this secction, I will be assuming that you are using managed HP Procurve switches – this is what I have experience of. The checks will be performed using SNMP – Simple Network Management Protocol which is basically a list of information regarding the workings of the switch – and, since a lot of vendors use similar RFCs and refence Ids (oids) they will probably work with kit such as Cisco and 3COM.

I will refer back to the example network above:

  • Switch #1 – HP Procurve 2828
  • Switch #2 – HP Procurve 4104gl (single PSU)
  • Switch #3 – HP Procurve 2650

The reason I have chosen these three switches is bacause of the slight variations in their hardware (e.g. the 2828 has more memory, the 4104gl has the option of a redundant PSU, etc.). If you have different switches, it is possible to inspect the contents of the oids using snmpwalk and a little trial and error with different threshold values.

Warning! More configuration file changes ahead:

  • When adding the lines of config below DO NOT include the ellipses (...) - these simply indicate that there will be config lines below!
  • First we need to add some custom commands to the end of checkcommands.cfg file:
...
##### Checks for HP Procurve Switches
define command{
        command_name    rmc_check_hpmemoryfree
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.5.1.1.2.1.1.1.6.1 -t 5 -w $ARG2$ -c $ARG3$ -u bytes -l free
        }
define command{
        command_name    rmc_check_hp_cpu
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.5.1.9.6.1.0 -t 5  -w $ARG2$ -c $ARG3$ -u % -l "5min cpu"
        }
define command{
        command_name    rmc_check_hpfan
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.1.2.6.1.4.1 -w $ARG2$ -c $ARG3$ -l 'Fan status'
        }
define command{
        command_name    rmc_check_hppower
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.1.2.6.1.4.2 -w $ARG2$ -c $ARG3$ -l 'Power Supply status'
        }
define command{
        command_name    rmc_check_hptemp
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.1.2.6.1.4.4 -w $ARG2$ -c $ARG3$ -l 'Temprature status'
        }
# For some reason the default slot for the 410x power supply is slot 2 :?
define command{
        command_name    rmc_check_hppower_4100
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.1.2.6.1.4.3 -w $ARG2$ -c $ARG3$ -l 'Power Supply status'
        }
  • We also need to make some changes at the end of the servicesTemplate.cfg file (bviously replace 'MyReadCommunity' with your SNMP Read community string):
...
##### Switch Service Definition Templates #####
define service{
    name                            switch-memory2800-service
    use                             generic-service
    service_description             MEMORY
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                           rmc_check_hpmemoryfree!MyReadCommunity!19000000:10000000!10000000:0 
    register                        0
    }
    
define service{
    name                            switch-memory-service
    use                             generic-service
    service_description             MEMORY
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                           rmc_check_hpmemoryfree!MyReadCommunity!2000:19000000!1000:19000000 
    register                        0
    }

define service{
    name                            switch-CPU-service
    use                             generic-service
    service_description             CPU
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                   rmc_check_hp_cpu!MyReadCommunity$!95:90!100:95
    register                        0
    }
    
define service{
    name                            switch-PSU-service
    use                             generic-service
    service_description             PSU
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                   rmc_check_hppower!MyReadCommunity!4!3:5 
    register                        0
    }

define service{
    name                            switch-PSU4100-service
    use                             generic-service
    service_description             PSU
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                   rmc_check_hppower_4100!MyReadCommunity!4!3:5
    register                        0
    }    

define service{
    name                            switch-temp-service
    use                             generic-service
    service_description             TEMP
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                   rmc_check_hptemp!MyReadCommunity!4!3:5
    register                        0
    }
 
define service{
    name                            switch-fan-service
    use                             generic-service
    service_description             FAN
    is_volatile                     0
    contact_groups                  switch-admins 
    check_command                   rmc_check_hpfan!MyReadCommunity!4!3:5
    register                        0
    }
  • Now we need to assign the services to the switches in servicesSwitches.cfg
...
# Check memory
define service{
    host_name           switch1
    use                 switch-memory2800-service
    }

define service{
    host_name           switch2,switch3
    use                 switch-memory-service
    }
    
# Check fan
define service{
    hostgroup_name      switches
    use                 switch-fan-service
    }

# Check CPU
define service{
    hostgroup_name      switches
    use                 switch-CPU-service
    }
    
# Check PSU
define service{
    host_name           switch1,switch3
    use                 switch-PSU-service
    }

define service{
    host_name           switch2
    use                 switch-PSU4100-service
    }
    
# Check temperature
#    4100-series switches do not appear to do temperature check
define service{
    host_name           switch1,switch3
    use                 switch-temp-service
    }

A more advanced UPS check

  • Thanks to a friendly programmer on Nagios Exchange, I found an UPS test that will quesry the SNMP details from the management card and tell me if the temperature gets silly or the batteries die. You will need to download this from http://www.nagiosexchange.org :-)
  • Edit the checkcommands.cfg file:
...
#####  Check APC UPS Status #####
define command{
	command_name	rmc_check_snmp_apcups
	command_line	usr/lib/nagios/plugins/check_snmp_apcups -H $HOSTADDRESS$ -C $ARG1$
	}
  • Edit the servicesTemplate.cfg file:
...
# UPS Check
define service{
    name                            UPS-check-service
    use                             generic-service
    service_description             UPS
    is_volatile                     0
    contact_groups                  misc-admins 
    check_command                   rmc_check_snmp_apcups!MyReadCommunity
    register                        0
    }
  • Edit the servicesMisc.cfg
...
# Check UPS Status
define service{
    host_name           UPS
    use                 UPS-check-service
    }

What about my Windows Servers?

We all know how clever network devices and Linux boxes are when it comes to giving out information on status but Windows tends to be lacking in this area. To get around this there is a plugin system, called NagiosPluginsNT, written in C# that fires useful information at the Nagios box using a system called NRPE. Both of these can be downloaded from Nagios Exchange. The instructions below are in two parts:

1. On the Nagios server

  • Install the libssl-dev package (this is required to compile the plugins)

# apt-get install libssl-dev

  • Download the check_nrpe plugin to your Nagios server from the Nagios download page and save in your home folder
  • Unpack the installation package

# tar xzvf nrpe-version.tar.gz

  • Change to the newly created directory and run the configure script

# cd nrpe-version # ./configure

  • Check that there are no errors and run the make command

# make all

  • Copy the freshly made plugin to the Nagios plugin directory

# cp ./src/check_nrpe /usr/local/nagios/libexec

  • Define the check in the file /usr/local/nagios/checkcommands.cfg by adding the following at the end:
...
# NRPE check
define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
        }
  • Edit /usr/local/nagios/etc/conf.d/servicesTemplates.cfg to define the NRPE service checks
...
##### NRPE Service Checks ##### 
define service{ 
	name				nrpe-memory-service 
	use				generic-service 
	service_description	Memory Usage 
	is_volatile			0 
	contact_groups		windows-admins 
	check_command		check_nrpe!check_mem 
	register			0 
	} 

# CPU Utililisation Check 
define service{ 
	name				nrpe-cpu-service 
	use				generic-service 
	service_description	CPU Utilisation 
	is_volatile			0 
	contact_groups		windows-admins 
	check_command		check_nrpe!check_cpu 
	register			0 
	}

# Free Disk Space (C:) Check 
define service{ 
	name				nrpe-diskC-service 
	use				generic-service 
	service_description	Disk Usage C: 
	is_volatile			0 
	contact_groups		windows-admins 
	check_command		check_nrpe!check_disk_c 
	register			0 
	} 

# Free Disk Space (D:) Check 
define service{ 
	name				nrpe-diskD-service 
	use				generic-service 
	service_description	Disk Usage D: 
	is_volatile			0 
	contact_groups		windows-admins 
	check_command		check_nrpe!check_disk_d 
	register			0 
	} 
  • Edit /usr/local/nagios/etc/conf.d/servicesServers.cfg to assign the service checks
...
# NRPE Checks
define service{
	host_name	WINDOW-BOX
	use		nrpe-memory-service
	}

define service{
	host_name	WINDOW-BOX
	use		nrpe-cpu-service
	}

define service{
	host_name	WINDOW-BOX
	use		nrpe-diskC-service
	}

define service{
	host_name	WINDOW-BOX
	use		nrpe-diskD-service
	} 

2. On the Windows server

  • Ensure that .NET version 2 is installed on the server
  • Download the NRPE_NT daemon from http://www.miwi-dv.com/nrpent/
  • Extract the bin folder to c:\nrpe_nt
  • Run the command to install NRPE_NT as a service

# c:\nrpe_nt\nrpe_nt -i

  • Download the latest version of the plugins from http://nagiospluginsnt.getproactivenow.com/download/releases/ - make sure you get the 'bin' version rather than the 'src' version
  • Extract the plugin package to c:\nrpe_nt\plugins
  • Edit c:\nrpe_nt\nrpe.cfg so that it includes the checks that you wish to perform
...
# Check disk space of C: and D: - warning at 90% full and critical at 95% full
command[check_disk_c]=C:\NRPE_NT\Plugins\diskspace_nrpe_nt.exe C: 90 95 
command[check_disk_d]=C:\NRPE_NT\Plugins\diskspace_nrpe_nt.exe D: 90 95 
# Check CPU utilisation – warning at 70% and critical at 85%
command[check_cpu]=C:\NRPE_NT\NagiosPluginsNT\check_cpu.exe -U % -w 70 -c 85 
# Check memory usage – warning at 70% and critical at 85%
command[check_mem]=C:\NRPE_NT\NagiosPluginsNT\check_mem.exe -U % -w 70 -c 85 
  • Restart the nrpe_nt service (Nagios Remote Plugin Executor for NT/W2K in Services management console)