Monitoring Thoughts

By tadgeobrien, 8 January, 2015

So for the last few years we have been working on developing the monitoring of our network. When I was first hired I was asked to look into what we might use. We were using What's Up Gold, but it had aged and wasn't being kept up. We were also using MRTG, but not on all of our links. Of course all the usually suspects showed up in all of my Google searches:

  • Nagios
  • Zenoss
  • Cacti
  • Zabbix

Then there were two that I hadn't seen before.

  • OpenNMS
  • Observiumn

When we got down to it we ended up going with OpenNMS. It gave us two things that we were looking for out of the box. Graphing for devices and up down status. During this time we were also able to purchase some training and support for the software which definitely helped as well. Since we were able to download and install the software prior we were able to do some testing first.

Since then I think we have been satisfied with the outcomes. One of the big things lately that I have been find interesting has been related to monitoring based on some of the things that we either hadn't had configured or just were using on our network equipment. Of course that was SNMP and Syslog.

The SNMP MIBs that were already set up within OpenNMS were easy to get us started on some basic monitoring. What quickly was realized was that we didn't really have a system that kept track of all of the devices that we were using. We have PDUs, UPS, transfer switches, network switches, wireless controller, firewalls, web filters, and a few other devices that we need to know about on the network.

As we started to use OpenNMS we started to unravel a few interesting things, which of course have led to a few more.

At this point we have finally made it to a place where we are ready to look at Syslog and SNMP traps. We have a couple of Linux devices set up as syslog servers with the hopes of eventually setting them up with anycast addressing. Now the next step is actually doing something.

Here are a few ideas of what I want to look at, some would require some investigation, other would simply be actionable:

From Syslog

  • DHCP "No Free Leases"
  • Dying gasps from switches
  • battery and failed self tests from UPS
  • PDU Outlets on and off
  • Power supplies on switches going up and down

From SNMP (Have to look back and see in my notes what else I was thinking about).

  • Power supplies going bad

After we actually get the events the next step is sending them along to our ticketing system with information on what our Service Desk and Technical Operations center can actually do with these. Hopefully by then we will also have some knowledge base type documents that we can create to make this even more functional.