I'm configuring Nagios to monitor my HP ProCurve switches. I found excellent command and service definitions at NagiosExchange and all is working wonderfully except for the service that monitors free memory.
The definitions that I'm using are copied and pasted directly from the above-mentions site. They are:
command in commands.cfg:
service in switch.cfg:Code:define command{ command_name check_hpmemoryfree command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o .1.3.6.1.4.1.11.2.14.11.5.1.1.2.1.1.1.6.1 -t 5 -w $ARG2$ -c $ARG3$ -u bytes -l free }
The switches list about 150MB of total memory, with about 109MB free when I view status from the switch console itself. Nagios is correctly reporting the free 109MB, but is showing the state as critical.Code:# Service definition MEM-FREE define service{ use generic-service ; Name of service template to use host_name Switch_MDF-1 service_description MEM-FREE is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 notification_interval 240 notification_period 24x7 notification_options c,r check_command check_hpmemoryfree!nagios!2000:30000000!1000:30000000
I've done a good bit of googling to try to understand how the "2000:30000000" and "1000:30000000" sections work. I realize that those are ARG2 and ARG3, and that ARG2 is the warning level and ARG3 is the critical level. What I don't understand is how to adjust those numbers to get the levels that I want to give warning and critical status on my particular switches. I've found info that states that two numbers followed by a colon are a range, and other info that says they are a less-than:higher-than definition for when to return the state defined by the command.
What I'd like is to have the following:
-Up to 60MB of free memory = OK
-Between 60MB and 40MB of free memory = Warning
-Less than 40MB of free memory = Critical
I will likely adjust those values once I get a better idea of memory usage under different loads.
I'd like to understand how to adjust the numbers in the service definition so that my service monitors will work as listed above. Can someone explain this, or point me to a resource that helps explain what the colon separated numbers mean on this particular command? I haven't had any luck in my searching, but I'm continuing to try to find as much information as I can to understand this.



LinkBack URL
About LinkBacks
Reply With Quote

