Monitoring Servers With OpenNMS
OpenNMS is an open source network management platform that uses-among other protocols-Simple Network Management Protocol (SNMP) to monitor servers. It can be configured to send email notifications out when the disk usage is too high or when the server is thrashing, etc.
The following instructions will show you how to sent up monitoring/notification when a disk is 90% full.
Installing OpenNMS
Steps to take during install/configuration:
- follow install instructions
- modify Service discovery and polling configuration (see below)
- change
org.opennms.core.utils.fromAddress
to the email address of your choice in/etc/opennms/javamail-configuration.properties
. This is the email address that OpenNMS will use to send out notification emails when a monitored server is having technical difficulties. - turned on notifications (Admin -> Notification Status: On)
Adding servers to OpenNMS to monitor
- Click "Add Node" on the top navigation
- Under "Basic Attributes"
- for "Provisioning Group", choose "Cloud"
- enter the IP address (the public/elastic IP address)
- In "Node Label", enter the public DNS host name
- leave all other fields as-is and click "Provision"
- add the new node to the appropriate surveillance categories in "Admin", "Manage Surveillance Categories"
Stop monitoring a server
- Click "Admin", then "Manage Provisioning Groups"
- To the right of "Requisition (Provisioning Group)" click "Edit"
- Click on the trash can icon to the left of the Node you wish to delete
- Click "Admin" (in top nav) again, then "Delete Nodes"
- Check both "Delete?" and "Data?" for the node you wish to remove
- Click "Delete Nodes"
Adding or removing email address which receive outage notifications
- Login as admin user
- Admin -> Configure Notifications -> Configure Destination Paths
- select Email-Admin and click Edit button below
- click Edit button on right
- click Add address on right (alternatively, select one and click "remove"
- click "Next >>>" on left
- click "Next >>>" again
- click Finish
If you mess up at any time (before clicking Finish), just navigate to a different page.
Also note the "Initial delay" setting... if service is restored before the initial delay, notifications are not sent.
Scheduled Outages
Scheduling downtime will avoid false outage emails.
- Login as admin user
- Admin -> Scheduled Outages
- Pick a name that describes why the outage will occur
- Specify Nodes, Type, and Time, then check "All notifications"
- all times are in UTC
Service discovery and polling configuration
Using this guide you can map your services on OpenNMS. For example, let's say that you run a service on your server named My Service and you would like OpenNMS to know about it so that it can monitor if the service goes down. You would monitor your OpenNMS configuration files like so:
<!-- added to capsd-configuration.xml --> <protocol-plugin protocol="My Service" class-name="org.opennms.netmgt.capsd.plugins.HttpPlugin" scan="on"> <property key="port" value="80" /> <property key="timeout" value="3000" /> <property key="retry" value="2" /> <parameter key="url" value="/my-url/" /> <property key="max-ret-code" value="202" /> </protocol-plugin> <!-- added to poller-configuration.xml --> <service name="My Service" interval="300000" user-defined="false" status="on"> <parameter key="port" value="80" /> <parameter key="timeout" value="10000" /> <parameter key="retry" value="3" /> <parameter key="url" value="/my-url/" /> <parameter key="ds-name" value="myService" /> </service> <monitor service="Pootle" class-name="org.opennms.netmgt.poller.monitors.HttpMonitor" />
If capsd-configuration is changed (for example, to add properties to a protocol plugin to make sure it is only discovered on hosts actually running that service), it may be necessary to manually delete corresponding rows in ifservices (those that map the service to a particular interface).
Instead of telling OpenNMS "this services runs on this IP address", you generally tell it "here is how you recognize a service, and here is a list of IP addresses... figure it out!" While it is possible to be more explicit about what services run where, with a proper service discovery and polling configuration, less manual maintenance is required.
Path outages
OpenNMS supports a feature designed to suppress outage notifications if, for instance, a WAN link between the OpenNMS server and remote (monitored) servers goes down.
Installing SNMP
Now that you have OpenNMS installed, you will need to install SNMP on all the computers that you will be monitoring.
Install Net-SNMP
Install Net-SNMP. Net-SNMP is a suite of SNMP applications that you will use to configure SNMP on the computer that you want to monitored. If you are on a debian based system, you can install Net-SNMP by opening a terminal and typing: apt-get install snmp libsnmp-dev snmpd
.
Set up your snmpd.conf file
snmpd.conf
holds the configuration settings for SNMP. You can either edit the file directly by modifying /etc/snmp/snmpd.conf
or by using the snmpconf configuration utility. Detailed information about snmpd.conf can be found here.
An example of a barebones snmpd.conf is:
rocommunity public default .1 disk /
This sets the read only community name to public and gives it access to viewing the whole SNMP tree. This means that anyone who tries to connect to this computer and use the public community will be able to read any SNMP value on it.
The second line tells SNMP that / is the root directory of a disk that we would like to monitor.
Now that you have correctly configured SNMP, if you type snmpwalk -On -v 2c -c public localhost 1.3.6.1.4.1.2021.9
into your commandline, you should see it print out a bunch of information on /.
Setting up SNMP for outside access
In order for OpenNMS to be able to read SNMP values from your server, you will have to perform one last step. Edit /etc/default/snmpd
to say:
SNMPDRUN=yes SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -I -smux -p /var/run/snmpd.pid'
All done! Now try connecting to your server on the computer that is running OpenNMS by typing: snmpwalk -On -v 2c -c public SERVER system
except replace SERVER with the ip address or hostname of the server where SNMP was just installed.
Setting up disk usage monitoring
In this section, we will discuss how to set up disk usage monitoring in OpenNMS.
While the rest of this guide dives into hand-editing the XML configuration files, most or all of the following can now be done directly via the OpenNMS Web UI. You may want to try that first.
Becoming familiar with the OpenNMS configuration files
These configuration files are in the /etc/opennms/
directory and are important for understanding for disk monitoring.
datacollection-config.xml
: This file is where OIDs are mapped to names that OpenNMS can understand. You may want to look at this document for more information on OIDs. If you search this document for "ns-dsk" you will be find where it maps disk OIDs to aliases.thresholds.xml
: This file is where thresholds are sent for events. For example, let's say you want to monitor if the disk usage ever gets to be above 90%. This is where you would set this threshold of 90%. In this file you will also set a second threshold, called the rearm threshold. Once the disk usage hits 90%, an event will be created to notify you. As long as it says above 90%, no new events will be made about it because that would be redundant. The threshold is "rearmed" (ie it can be triggered again) once the disk usage goes below the rearm threshold, which might be something like 75%.eventconf.xml
: As mentioned earlier, events are created once thresholds are exceeded. The event configuration file points to other event files. Each event file holds different kinds of events, and you will want to make to hold your custom events.notifications.xml
: This file contains the configuration for sending out email notifications for events. Events will simply show up in the OpenNMS web interface and never send out email notifications unless you configure this file as well.notifd-configuration.xml
: This file contains the configuration for automatic notification acknowledgments. You acknowledgment events via the web interface and by doing, so you clear it from the queue of events. Most events have a matching event (ie if disk usage goes above the threshold, an event will be made warning you that the disk usage is too high. When it goes back under the rearm threshold, a new event will be made to say that the disk recovered). As a result, if you set up automatic acknowledgments, if you acknowledge the second event (ie the rearm), the first event will automatically be acknowledged. Setting up notifications like this requires less code than setting them up without automatic notifications and so may be preferable. We'll give both ways to set up notifications in the following sections.
Modifying the thresholds.xml file
There should already be an entry in your thresholds file for disk usage. It should look something like this:
<threshold type="high" ds-type="dskIndex" value="90.0" rearm="75.0" trigger="2" ds-label="ns-dskPath" ds-name="ns-dskPercent"/>
This means that if the ns-dskPercent exceeds 90, then an event will be made. If it goes back lower than 75, the threshold will be rearmed.
You will want to add your own custom threshold and rearm UEIs (Unique Event Identifier). Without custom UEIs, disk usage will be handled like all the other events that have thresholds of type high. This means that if you want to notification emails whenever there is high disk usage, you will have to turn on email notifications for all events with thresholds of type high. You probably do not want notifications for all high events, so let's go ahead and create custom threshold and rearm UEIs:
<threshold type="high" ds-type="dskIndex" value="90.0" rearm="75.0" trigger="2" ds-label="ns-dskPath" ds-name="ns-dskPercent" triggeredUEI="uei.opennms.org/YOURCOMPANY/ns-dskPercent-high" rearmedUEI="uei.opennms.org/YOURCOMPANY/ns-dskPercent-rearm" />
Obviously, you will want to replace YOURCOMPANY with your company's name.
Creating your custom event file
Okay now we will need to create a custom event file that will have events for our custom UEIs that we created. An advantage to creating custom UEIs is that we can write customized descriptions. The ones that OpenNMS has by default are pretty generic and hard to read and so when you read the event message in the web interface, it is hard to understand what it means at first.
Create a file in the events directory of your OpenNMS installation directory called YOURCOMPANY.events.xml where YOURCOMPANY is your company's name. :
<?xml version="1.0" encoding="UTF-8"?> <events xmlns="http://xmlns.opennms.org/xsd/eventconf"> <event> <uei xmlns="">uei.opennms.org/grameen/ns-dskPercent-high</uei> <event-label xmlns="">Disk Usage Too High</event-label> <descr xmlns="">The disk usage on %parm[label]% on interface %interface% is too high. It is %parm[value]% percent full, which exceeds the high threshold of %parm[threshold]% percent.</descr> <logmsg dest="logndisplay">The disk usage on %parm[label]% on interface %interface% is too high.</logmsg> <severity xmlns="">Minor</severity> <alarm-data reduction-key="%uei%!%nodeid%!%parm[label]%" alarm-type="1" auto-clean="false" /> </event> <event> <uei xmlns="">uei.opennms.org/YOURCOMPANY/ns-dskPercent-rearm</uei> <event-label xmlns="">Disk Usage Too High - Re-Armed</event-label> <descr xmlns="">Threshold rearmed for the disk usage for %parm[label]% on interface %interface. It was rearmed because the disk usage is less than or equal to the rearm threshold of %param[rearm]%.</descr> <logmsg dest="logndisplay">Threshold rearmed for the disk usage for %parm[label]% on interface %interface.</logmsg> <severity xmlns="">Normal</severity> <alarm-data clear-key="uei.opennms.org/YOURCOMPANY/ns-dskPercent-high!%nodeid%!%parm[label]%" reduction-key="%uei%:%nodeid%:%parm[label]%" alarm-type="2" auto-clean="true" /> </event> </events>
%parm[label]%
is the name of the disk you are monitoring, ie "/". %parm[value]%
is the percentage that the disk is full. %parm[threshold]%
is the threshold (ie 90%). %interface% is the interfact that this is happening on (ie the IP address of the computer). And %parm[rearm]%
is the rearm threshold (ie 75%).
The descr is the long description message that you see if you click on the event in the web interface. The logmsg is just the short message you see if you preview events in the web interface.
Modifying the eventconf.xml file
Now we need to put an entry into the event configuration file that says "look for our custom event file".
In the second to last line in the file (above </events>
) you should put:
<event-file>events/YOURCOMPANY.events.xml</event-file>
This will point to our custom file. You will have to replace YOURCOMPANY with your company's name.
Modifying your notifications.xml file
Now we have set up events but we have not set up email notifications for our events. In the second to last line (above </notifications>
) you should put:
<notification name="Disk Usage Too High" status="on" writeable="yes"> <uei xmlns="">uei.opennms.org/YOURCOMPANY/ns-dskPercent-high</uei> <description xmlns="">The disk usage for a monitored device is too high.</description> <rule xmlns="">IPADDR != '0.0.0.0'</rule> <destinationPath xmlns="">Email-Admin</destinationPath> <text-message xmlns="">The disk usage on %parm[label]% on interface %interface% is too high. It is %parm[value]% percent full, which exceeds the high threshold of %parm[threshold]% percent.</text-message> <subject xmlns="">Notice #%noticeid%: The disk usage on %parm[label]% on interface %interface% is too high.</subject> </notification>
You will have to replace YOURCOMPANY with your company's name.
The status "on" option means that it is set to send out emails. (If you want to turn off notifications for this event temporarily, you can set it to "off"). The destinationPath is the saved email address that you want to send emails to. (See the "Adding or removing email address which receive outage notifications" section above).
As mentioned in the notifd-configuration.xml
section above, you can set up automatic acknowledgments, but you may not want to. Automatic acknowledgments require less code, but if you would like every acknowledgment to be acknowledged manually, then you will want to go ahead with entering a notification entry for rearm events:
<notification name="Disk Usage Too High Rearmed" status="on" writeable="yes"> <uei xmlns="">uei.opennms.org/YOURCOMPANY/ns-dskPercent-rearm</uei> <description xmlns="">A monitored device has recovered from a high disk usage</description> <rule xmlns="">IPADDR != '0.0.0.0'</rule> <destinationPath xmlns="">Email-Admin</destinationPath> <text-message xmlns="">Threshold rearmed for the disk usage for %parm[label]% on interface %interface. It was rearmed because the disk usage is less than or equal to the rearm threshold of %param[rearm]%.</text-message> <subject xmlns="">Notice #%noticeid%: Low Threshold Rearmed for %parm[ds]% on node %nodelabel%.</subject> </notification>
Modifying your notifd-configuration.xml file
If you would like automatic acknowledgment instead, you will have to enter this in the line before there is an entry for <queue>
:
<auto-acknowledge resolution-prefix="RESOLVED: " uei="uei.opennms.org/YOURCOMPANY/ns-dskPercent-rearm" acknowledge="uei.opennms.org/YOURCOMPANY/ns-dskPercent-high"> <match xmlns="">interfaceid</match> </auto-acknowledge>
You will have to replace YOURCOMPANY with your company's name. This will tell it to make an email that has "RESOLVED" in the subject line when the interface of second event (the ns-dskPercent-rearm) matches interface that the first event (ns-dskPercent-high) was seen on.