Introducing Nagios
This part shows you how to install Nagios and tie Ganglia back into it. We're going to add two features to Nagios that'll help your monitoring efforts in standard clusters, grids, clouds (or whatever your favorite buzzword is for scale-out computing). The two features are all about:
· Monitoring network switches
· Monitoring the resource manager
In this case, we'll be monitoring TORQUE. When we are finished, you'll have a framework to control the monitoring system of your entire data center.
Nagios, like Ganglia, is used heavily in HPC and other environments, but Nagios is more of an alerting mechanism that Ganglia (which is more focused on gathering and tracking metrics). Nagios previously only polled information from its target hosts, but has recently developed plug-ins that allow it to run agents on those hosts. Nagios has a built-in notification system.
Now let's install Nagios and set up a baseline monitoring system of an HPC Linux® cluster to address the three different monitoring perspectives:
· The application person can see how full the queues are and see available nodes for running jobs.
· The NOC can be alerted of system failures or see a shiny red error light on the Nagios Web interface. They also get notified via email if nodes go down or temperatures get too high.
· The system engineer can graph data, report on cluster utilization, and make decisions on future hardware acquisitions.
[Apr 21, 2009] check_openmanage freshmeat.net
check_openmanage is a plugin for Nagios that checks the hardware health of Dell PowerEdge and PowerVault servers. It uses the Dell OpenManage Server Administrator (OMSA) software to accomplish this task. check_openmanage can be used remotely with SNMP or locally with NRPE. The plugin checks the health of the storage subsystem, power supplies, memory modules, temperature probes, etc., and gives an alert if any of the components are faulty or operate outside normal parameters.
Changes: The --global option was added, which turns on checking of everything. If used with SNMP, the global system health status is also probed, to protect the user against bugs in the... plugin. If used with omreport, the overall chassis health is used. Support for SNMP version 3 was added. Checking of esmhealth was added, which checks the overall health of the ESM log, i.e. the fill grade. Alert log reporting was fixed to use the same format as for the ESM log. Output messages are now sorted by severity. Minor changes were made in how out-of-date controller firmware/driver is reported
[Dec 22, 2008] Nagiosgraph
Nagiosgraph is an add-on for Nagios. It collects service perfdata in RRD format, and displays the resulting graphs via CGI.
Nagios is a powerful, modular network monitoring system that can be used to monitor many network services like smtp, http and dns on remote hosts. It also has support for snmp to allow you to check things like processor loads on routers and servers. I couldn't begin to cover all of the things that nagios can do in this article, so I'll just cover the basics to get you up and running.
apt-get install nagios-text
First we need to define people that will be notified, and define how they should be notified. In the example below, I define two users, joe and paul. Joe is the network guru and cares about routers and switches. Paul is the systems guy, and he cares about servers. Both will be notified via email and by pager. Note that if you are going to monitor your email server, you will want to use another notification method besides email. If your email server is down, you can't send anybody an email to notify them! :) In that case you will want to use a pager server to send a text message to a phone or pager, or set up a second nagios monitor that uses a different mail server to send email.
Edit /etc/nagios/contacts.cfg and add the following users:
define contact{
contact_name joe
alias Joe Blow
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email,notify-by-pager
host_notification_commands host-notify-by-email,host-notify-by-epager
email joe@yourdomain.com
pager 5555555@pager.yourdomain.com
}
define contact{
contact_name paul
alias Paul Shiznit
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email,notify-by-epager
host_notification_commands host-notify-by-email,host-notify-by-epager
email paul@yourdomain.com
pager 5556666@pager.yourdomain.com
}
Now add the users to groups.
In /etc/nagios/contactgroups.cfg add the following:
In /etc/nagios/contactgroups.cfg add the following:
define contactgroup{
contactgroup_name router_admin
alias Network Administrators
members joe
}
define contactgroup{
contactgroup_name server_admin
alias Systems Administrators
members paul
}
You can add multiple members to a contact group by listing comma separated users.
Now to define some hosts to monitor. For my example, I define two machines, a mail server and a router.
Edit /etc/nagios/hosts.cfg and add:
define host{
use generic-host
host_name gw1.yourdomain.com
alias Gateway Router
address 10.0.0.1
check_command check-host-alive
max_check_attempts 20
notification_interval 240
notification_period 24x7
notification_options d,u,r
}
define host{
use generic-host
host_name mail.yourdomain.com
alias Mail Server
address 10.0.0.100
check_command check-host-alive
max_check_attempts 20
notification_interval 240
notification_period 24x7
notification_options d,u,r
}
Now we add the hosts to groups. I define groups called 'routers' and 'servers' and add the router and mail server respectively.
Edit /etc/nagios/hostgroups.cfg
define hostgroup{
hostgroup_name routers
alias Routers
contact_groups router_admin
members gw1.yourdomain.com
}
define hostgroup{
hostgroup_name servers
alias Servers
contact_groups server_admin
members mail.yourdomain.com
}
Again, for multiple members, just use a comma separated list of hosts.
Next define services to monitor on each of the hosts. Nagios has many built-in plugins for monitoring. On a debian sarge system, they are stored in /usr/lib/nagios/plugins. Here we want to monitor the smtp service on the mail server, and do ping checks on the router.
Edit /etc/nagios/services.cfg
define service{
use generic-service
host_name mail.yourdomain.com
service_description SMTP
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups server_admin
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
check_command check_smtp
}
define service{
use generic-service
host_name gw1.yourdomain.com
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups router_admin
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
check_command check_ping!100.0,20%!500.0,60%
}
And that's it. To test your configurations, you can run
nagios -v /etc/nagios/nagios.cfg
If all is well we can restart nagios and move on to the apache side to get a visual view of the monitor.
/etc/init.d/nagios restart
Assuming you have a working apache install, you can add the apache.conf file included in the nagios package to set up the nagios cgi administration interface. The web interface is not required to run nagios, but it is definitely worth setting it up. The simplest way to get it up and running is to copy the supplied conf file over to our apache installation. On my system, I'm running apache2. Systems running apache 1.3.xx will have slightly different setups.
cp /etc/nagios/apache.conf /etc/apache2/sites-enabled/nagios
Of course you may want to set it up as a virtual server, but I leave that as an exercise for the reader. Now you will want to set up an allowed user to view the cgi interface. By default, nagios issues full administrative access to the nagiosadmin user. Nagios uses apache htpasswd style authentication. So here we add a user and password to the default nagios htpasswd file. Here we add the user nagiosadmin with password mypassword to the nagios htpasswd file.
htpasswd2 -nb nagiosadmin mypassword >> /etc/nagios/htpasswd.users
You should now be able to restart apache and logon to
http://your.nagios.server/nagios
Nagios is a very powerful tool for monitoring networks. I've only touched on the basics here, but it should be enough to get you up and running. Hopefully, once you do, you'll start experimenting with all the cool features and plugins that are available. The documentation included in the cgi interface is very detailed and helpful.
About: Nagstamon is a Nagios status monitor with a UI that resides in the GNOME systray or on the Windows desktop. It informs you in realtime about the status of your Nagios monitored network.
Changes: This release fixes a problem with passwords containing special characters, and an issue where it omitted showing failed services on hosts in scheduled downtime.
[Jun 25, 2008] check_oracle_health
About: check_oracle_health is a plugin for the Nagios monitoring software that allows you to monitor various metrics of an Oracle database. It includes connection time, SGA data buffer hit ratio, SGA library cache hit ratio, SGA dictionary cache hit ratio, SGA shared pool free, PGA in memory sort ratio, tablespace usage, tablespace fragmentation, tablespace I/O balance, invalid objects, and many more.
Release focus: Major feature enhancements
Changes: The tablespace-usage mode now takes into account when tablespaces use autoextents. The data-buffer/library/dictionary-cache-hitratio are now more accurate. Sqlplus can now be used instead of DBD::Oracle.
Configuring Nagios ¶
In the main config file, make sure that the command_file directive is set and that it works. See http://nagios.sourceforge.net/docs/2_0/configmain.html#command_file for details.
Below is a sample extract from nagios.cfg:
command_file=/var/run/nagios/nagios.cmd
The /var/run/nagios directory is owned by the user nagios runs as. The nagios.cmd is a named pipe on which Nagios accepts external input.
Configuring NSCA, server side ¶
NSCA is run through (x)inetd. Using inetd, the below line enables NSCA listening on port 5667:
5667 stream tcp nowait nagios /usr/sbin/tcpd /usr/sbin/nsca -c /etc/nsca.cfg --inetd
Using xinetd, the blow line enables NSCA listening on port 5667, allowing connections only from the local host:
# description: NSCA (Nagios Service Check Acceptor)
service nsca
{
flags = REUSE
type = UNLISTED
port = 5667
socket_type = stream
wait = no
server = /usr/sbin/nsca
server_args = -c /etc/nagios/nsca.cfg --inetd
user = nagios
group = nagios
log_on_failure += USERID
only_from = 127.0.0.1
}
The file /etc/nsca.cfg defines how NSCA behaves. Check in particular the nsca_user and command_file directives, these should correspond to the file permissions and the location of the named pipe described in nagios.cfg.
nsca_user=nagios
command_file=/var/run/nagios/nagios.cmd
Configuring NSCA, client side ¶
The NSCA client is a binary that submits to an NSCA server whatever it received as arguments. Its behaviour is controlled by the file /etc/send_nsca.cfg, which mainly controls encryption.
You should now be able to test the communication between the NSCA client and the NSCA server, and consequently whether Nagios picks up the message. NSCA requires a defined format for messages. For service checks, it's like this: <host_name>[tab]<svc_description>[tab]<return_code>[tab]<plugin_output>[newline]
Below is shown how to test NSCA.
$ /usr/sbin/send_nsca -H localhost -c /etc/send_nsca.cfg
foo.example.com test 0 0
1 data packet(s) sent to host successfully.
This caused the following to appear in /var/log/nagios/nagios.log:
[1159868622] Warning: Message queue contained results for service 'test' on host 'foo.example.com'. The service could not be found!
Messages are sent by munin-limits based on the state of a monitored data source: OK, Warning and Critical. Munin does not currently support a Unknown state (This will be fixed in the future, see Ticket 29 for more information).
Configuring munin.conf ¶
Nagios uses the above mentioned send_nsca binary to send messages to Nagios. In /etc/munin/munin.conf, enter this:
contacts nagios
contact.nagios.command /usr/bin/send_nsca -H your.nagios-host.here -c /etc/send_nsca.cfg
! | | Be aware that the -H switch to send_nsca appeared sometime after send_nsca version 2.1. Always check send_nsca --help! |
Configuring Munin plugins ¶
Lots of Munin plugins have (hopefully reasonable) values for Warning and Critical levels. To set or override these, you can change the values in munin.conf.
Configuring Nagios services ¶
Now Nagios needs to recognize the messages from Munin as messages about services it monitors. To accomplish this, every message Munin sends to Nagios requires a matching (passive) service defined or Nagios will ignore the message (but it will log that something tried).
A passive service is defined through these directives in the proper Nagios configuration file:
active_checks_enabled 0
passive_checks_enabled 1
A working solution is to create a template for passive services, like the one below:
define service {
name passive-service
active_checks_enabled 0
passive_checks_enabled 1
parallelize_check 1
notifications_enabled 1
event_handler_enabled 1
register 0
is_volatile 1
}
When the template is registered, each Munin plugin should be registered as per below:
define service {
use passive-service
host_name foo
service_description bar
check_period 24x7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups linux-admins
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_command check_dummy!0
}
Notes ¶
· host_name is either the FQDN of the host_name registered to the Nagios plugin, or the host alias corresponding to Munin's
No comments:
Post a Comment