Thursday 3 November 2011

ABOUT NAGIOS


Introducing Nagios
This part shows you how to install Nagios and tie Ganglia back into it. We're going to add two features to Nagios that'll help your monitoring efforts in standard clusters, grids, clouds (or whatever your favorite buzzword is for scale-out computing). The two features are all about:
·         Monitoring network switches
·         Monitoring the resource manager
In this case, we'll be monitoring TORQUE. When we are finished, you'll have a framework to control the monitoring system of your entire data center.
Nagios, like Ganglia, is used heavily in HPC and other environments, but Nagios is more of an alerting mechanism that Ganglia (which is more focused on gathering and tracking metrics). Nagios previously only polled information from its target hosts, but has recently developed plug-ins that allow it to run agents on those hosts. Nagios has a built-in notification system.
Now let's install Nagios and set up a baseline monitoring system of an HPC Linux® cluster to address the three different monitoring perspectives:
·         The application person can see how full the queues are and see available nodes for running jobs.
·         The NOC can be alerted of system failures or see a shiny red error light on the Nagios Web interface. They also get notified via email if nodes go down or temperatures get too high.
·         The system engineer can graph data, report on cluster utilization, and make decisions on future hardware acquisitions.
check_openmanage is a plugin for Nagios that checks the hardware health of Dell PowerEdge and PowerVault servers. It uses the Dell OpenManage Server Administrator (OMSA) software to accomplish this task. check_openmanage can be used remotely with SNMP or locally with NRPE. The plugin checks the health of the storage subsystem, power supplies, memory modules, temperature probes, etc., and gives an alert if any of the components are faulty or operate outside normal parameters.
Changes: The --global option was added, which turns on checking of everything. If used with SNMP, the global system health status is also probed, to protect the user against bugs in the... plugin. If used with omreport, the overall chassis health is used. Support for SNMP version 3 was added. Checking of esmhealth was added, which checks the overall health of the ESM log, i.e. the fill grade. Alert log reporting was fixed to use the same format as for the ESM log. Output messages are now sorted by severity. Minor changes were made in how out-of-date controller firmware/driver is reported
[Dec 22, 2008] Nagiosgraph
Nagiosgraph is an add-on for Nagios. It collects service perfdata in RRD format, and displays the resulting graphs via CGI. 
Nagios is a powerful, modular network monitoring system that can be used to monitor many network services like smtp, http and dns on remote hosts. It also has support for snmp to allow you to check things like processor loads on routers and servers. I couldn't begin to cover all of the things that nagios can do in this article, so I'll just cover the basics to get you up and running.
apt-get install nagios-text
First we need to define people that will be notified, and define how they should be notified. In the example below, I define two users, joe and paul. Joe is the network guru and cares about routers and switches. Paul is the systems guy, and he cares about servers. Both will be notified via email and by pager. Note that if you are going to monitor your email server, you will want to use another notification method besides email. If your email server is down, you can't send anybody an email to notify them! :) In that case you will want to use a pager server to send a text message to a phone or pager, or set up a second nagios monitor that uses a different mail server to send email.
Edit /etc/nagios/contacts.cfg and add the following users:
define contact{
    contact_name                    joe
    alias                           Joe Blow
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-by-email,notify-by-pager
    host_notification_commands      host-notify-by-email,host-notify-by-epager
    email                           joe@yourdomain.com
    pager                           5555555@pager.yourdomain.com
    }

define contact{
    contact_name                    paul
    alias                           Paul Shiznit
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-by-email,notify-by-epager
    host_notification_commands      host-notify-by-email,host-notify-by-epager
    email                           paul@yourdomain.com
    pager                           5556666@pager.yourdomain.com
    }
Now add the users to groups.
In /etc/nagios/contactgroups.cfg add the following:
define contactgroup{
    contactgroup_name   router_admin
    alias               Network Administrators
    members             joe
}

define contactgroup{
    contactgroup_name   server_admin
    alias               Systems Administrators
    members             paul
}
You can add multiple members to a contact group by listing comma separated users.
Now to define some hosts to monitor. For my example, I define two machines, a mail server and a router.
Edit /etc/nagios/hosts.cfg and add:
define host{
    use                     generic-host
    host_name               gw1.yourdomain.com
    alias                   Gateway Router
    address                 10.0.0.1
    check_command           check-host-alive
    max_check_attempts      20
    notification_interval   240
    notification_period     24x7
    notification_options    d,u,r
    }

define host{
    use                     generic-host
    host_name               mail.yourdomain.com
    alias                   Mail Server
    address                 10.0.0.100
    check_command           check-host-alive
    max_check_attempts      20
    notification_interval   240
    notification_period     24x7
    notification_options    d,u,r
    }
Now we add the hosts to groups. I define groups called 'routers' and 'servers' and add the router and mail server respectively.
Edit /etc/nagios/hostgroups.cfg
define hostgroup{
    hostgroup_name  routers
    alias           Routers
    contact_groups  router_admin
    members         gw1.yourdomain.com
    }

define hostgroup{
    hostgroup_name  servers
    alias           Servers
    contact_groups  server_admin
    members         mail.yourdomain.com
    }
Again, for multiple members, just use a comma separated list of hosts.
Next define services to monitor on each of the hosts. Nagios has many built-in plugins for monitoring. On a debian sarge system, they are stored in /usr/lib/nagios/plugins. Here we want to monitor the smtp service on the mail server, and do ping checks on the router.
Edit /etc/nagios/services.cfg
define service{
    use                     generic-service
    host_name               mail.yourdomain.com
    service_description     SMTP
    is_volatile             0
    check_period            24x7
    max_check_attempts      3
    normal_check_interval   5
    retry_check_interval    1
    contact_groups          server_admin
    notification_interval   240
    notification_period     24x7
    notification_options    w,u,c,r
    check_command           check_smtp
    }

define service{
    use                     generic-service
    host_name               gw1.yourdomain.com
    service_description     PING
    is_volatile             0
    check_period            24x7
    max_check_attempts      3
    normal_check_interval   5
    retry_check_interval    1
    contact_groups          router_admin
    notification_interval   240
    notification_period     24x7
    notification_options    w,u,c,r
    check_command           check_ping!100.0,20%!500.0,60%
    }
And that's it. To test your configurations, you can run
nagios -v /etc/nagios/nagios.cfg
If all is well we can restart nagios and move on to the apache side to get a visual view of the monitor.
/etc/init.d/nagios restart
Assuming you have a working apache install, you can add the apache.conf file included in the nagios package to set up the nagios cgi administration interface. The web interface is not required to run nagios, but it is definitely worth setting it up. The simplest way to get it up and running is to copy the supplied conf file over to our apache installation. On my system, I'm running apache2. Systems running apache 1.3.xx will have slightly different setups.
cp /etc/nagios/apache.conf /etc/apache2/sites-enabled/nagios
Of course you may want to set it up as a virtual server, but I leave that as an exercise for the reader. Now you will want to set up an allowed user to view the cgi interface. By default, nagios issues full administrative access to the nagiosadmin user. Nagios uses apache htpasswd style authentication. So here we add a user and password to the default nagios htpasswd file. Here we add the user nagiosadmin with password mypassword to the nagios htpasswd file.
htpasswd2 -nb nagiosadmin mypassword >> /etc/nagios/htpasswd.users
You should now be able to restart apache and logon to
http://your.nagios.server/nagios
Nagios is a very powerful tool for monitoring networks. I've only touched on the basics here, but it should be enough to get you up and running. Hopefully, once you do, you'll start experimenting with all the cool features and plugins that are available. The documentation included in the cgi interface is very detailed and helpful.
 
About: Nagstamon is a Nagios status monitor with a UI that resides in the GNOME systray or on the Windows desktop. It informs you in realtime about the status of your Nagios monitored network.
Changes: This release fixes a problem with passwords containing special characters, and an issue where it omitted showing failed services on hosts in scheduled downtime.
[Jun 25, 2008] check_oracle_health
About: check_oracle_health is a plugin for the Nagios monitoring software that allows you to monitor various metrics of an Oracle database. It includes connection time, SGA data buffer hit ratio, SGA library cache hit ratio, SGA dictionary cache hit ratio, SGA shared pool free, PGA in memory sort ratio, tablespace usage, tablespace fragmentation, tablespace I/O balance, invalid objects, and many more.
Release focus: Major feature enhancements
Changes: The tablespace-usage mode now takes into account when tablespaces use autoextents. The data-buffer/library/dictionary-cache-hitratio are now more accurate. Sqlplus can now be used instead of DBD::Oracle.
Configuring Nagios
In the main config file, make sure that the command_file directive is set and that it works. See http://nagios.sourceforge.net/docs/2_0/configmain.html#command_file for details.
Below is a sample extract from nagios.cfg:
command_file=/var/run/nagios/nagios.cmd
The /var/run/nagios directory is owned by the user nagios runs as. The nagios.cmd is a named pipe on which Nagios accepts external input.
Configuring NSCA, server side
NSCA is run through (x)inetd. Using inetd, the below line enables NSCA listening on port 5667:
5667            stream  tcp     nowait  nagios  /usr/sbin/tcpd  /usr/sbin/nsca -c /etc/nsca.cfg --inetd
Using xinetd, the blow line enables NSCA listening on port 5667, allowing connections only from the local host:
# description: NSCA (Nagios Service Check Acceptor)
service nsca
{
 flags           = REUSE
 type          = UNLISTED
 port          = 5667
 socket_type     = stream
 wait            = no

 server          = /usr/sbin/nsca
 server_args     = -c /etc/nagios/nsca.cfg --inetd
 user            = nagios
 group           = nagios

 log_on_failure  += USERID

 only_from       = 127.0.0.1
}
The file /etc/nsca.cfg defines how NSCA behaves. Check in particular the nsca_user and command_file directives, these should correspond to the file permissions and the location of the named pipe described in nagios.cfg.
nsca_user=nagios
command_file=/var/run/nagios/nagios.cmd
Configuring NSCA, client side
The NSCA client is a binary that submits to an NSCA server whatever it received as arguments. Its behaviour is controlled by the file /etc/send_nsca.cfg, which mainly controls encryption.
You should now be able to test the communication between the NSCA client and the NSCA server, and consequently whether Nagios picks up the message. NSCA requires a defined format for messages. For service checks, it's like this: <host_name>[tab]<svc_description>[tab]<return_code>[tab]<plugin_output>[newline]
Below is shown how to test NSCA.
$ /usr/sbin/send_nsca -H localhost -c /etc/send_nsca.cfg
foo.example.com  test   0       0
1 data packet(s) sent to host successfully.
This caused the following to appear in /var/log/nagios/nagios.log:
[1159868622] Warning:  Message queue contained results for service 'test' on host 'foo.example.com'.  The service could not be found!
Messages are sent by munin-limits based on the state of a monitored data source: OK, Warning and Critical. Munin does not currently support a Unknown state (This will be fixed in the future, see Ticket 29 for more information).
Configuring munin.conf
Nagios uses the above mentioned send_nsca binary to send messages to Nagios. In /etc/munin/munin.conf, enter this:
contacts nagios
contact.nagios.command /usr/bin/send_nsca -H your.nagios-host.here -c /etc/send_nsca.cfg
!

Be aware that the -H switch to send_nsca appeared sometime after send_nsca version 2.1. Always check send_nsca --help!
Configuring Munin plugins
Lots of Munin plugins have (hopefully reasonable) values for Warning and Critical levels. To set or override these, you can change the values in munin.conf.
Configuring Nagios services
Now Nagios needs to recognize the messages from Munin as messages about services it monitors. To accomplish this, every message Munin sends to Nagios requires a matching (passive) service defined or Nagios will ignore the message (but it will log that something tried).
A passive service is defined through these directives in the proper Nagios configuration file:
active_checks_enabled           0
passive_checks_enabled          1
A working solution is to create a template for passive services, like the one below:
define service {
        name                            passive-service
        active_checks_enabled           0
        passive_checks_enabled          1
        parallelize_check               1
        notifications_enabled           1
        event_handler_enabled           1
        register                        0
        is_volatile                     1
}
When the template is registered, each Munin plugin should be registered as per below:
define service {
        use                             passive-service
        host_name                       foo
        service_description             bar
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           3
        retry_check_interval            1
        contact_groups                  linux-admins
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   check_dummy!0
}
Notes
·         host_name is either the FQDN of the host_name registered to the Nagios plugin, or the host alias corresponding to Munin's

No comments:

Post a Comment