A script to debug cacti
January 10, 2013 1 Comment
My check_cacti.sh reports such warning frequently.
01/09/2013 05:50:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate 01/09/2013 05:53:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate 01/09/2013 05:56:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate 01/09/2013 05:59:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
It basically means some pollers cannot finish in time. But it didn’t report which poller items.
So I wrote below script to check it.
#!/bin/bash if [ $# -lt 1 ]; then cat <<EOF Check long run cacti poller commands usage: debug_cacti.sh <threshold_second> [thresh_minutes] examples: debug_cacti.sh 30 #check command run longer than 30 seconds in 3 minutes debug_cacti.sh 30 5 #check command run longer than 30 seconds in 5 minutes EOF exit 1 fi # arguments thresh_sec="$1" thresh_min="$2" if [ "alex$thresh_min" = "alex" ] ; then thresh_min=3 fi # configuration echo "Check poller command longer than $thresh_sec seconds at next $thresh_min minutes..." echo "Start time : `date +%M:%S`" start_min=`date +%M` min=`date +%M` elapsed_min=0 while [ $elapsed_min -lt $thresh_min ] do #sleep a while sleep_sec=`expr $thresh_sec - $sec + 1` if [ $sleep_sec -gt 0 ] ; then echo "Sleep $sleep_sec seconds " sleep $sleep_sec fi sec=`date +%S` while [ $sec -gt $thresh_sec ] do echo "Time : $min:$sec" ps -ef |grep php |grep -v grep sleep 3 sec=`date +%S` done #calculate elapsed time min=`date +%M` elapsed_min=`expr $min - $start_min` if [ $elapsed_min -lt 0 ] ; then #round at 60 elapsed_min=`expr $elapsed_min + 60` fi done
One of my output as follow, it shows that the poller on the cassandra host is slow (the last one to finish).
./debug_cacti.sh 30 Check poller command longer than 30 at next 3 minutes... Start time : 29: Time = 29:31 root 27597 1 0 23:28 ? 00:00:00 /usr/bin/php -q /export/home/cacti-0.8.7g/cmd.php 0 67 root 29338 29337 0 23:29 ? 00:00:00 /bin/sh -c php /var/www/html/cacti/poller.php > /dev/null 2>&1 root 29339 29338 1 23:29 ? 00:00:00 php /var/www/html/cacti/poller.php root 30795 27597 0 23:29 ? 00:00:00 php /export/home/cacti-0.8.7g/scripts/ss_get_cassandra_stats.php --host sharedcass.alexzeng.wordpress.com --port --user --pass --items dp Time = 29:34 root 27597 1 0 23:28 ? 00:00:00 /usr/bin/php -q /export/home/cacti-0.8.7g/cmd.php 0 67 root 30981 27597 0 23:29 ? 00:00:00 php /export/home/cacti-0.8.7g/scripts/ss_get_cassandra_stats.php --host sharedcass.alexzeng.wordpress.com --port --user --pass --items dc,dd Time = 29:37 Time = 29:40 Time = 29:43 Time = 29:46 Time = 29:49 Time = 29:52 Time = 29:55 Time = 29:58 ...
Now I got the problem. It’s halfway success.
Thank you very much!! Very useful script which helped me to find the proccess that did not respond. Now Cacti poller doesn´t timeout and all params are being graph correctly!