Cacti host run out of capacity

Our cacti host in production run out of  capacity recently. We use cacti to create graphs for MySQL, including Innodb and Memory engine db, and MongoDB. The key benefits of cacti is that it’s easy for users to understand, and our developers can easily check DB performance metrics by themselves. It’s also easy for DBA to setup them because it didn’t need to setup/maintain agents at each DB host.

The benefit of easy-setup is also causes problem: all poller actions have to be done at cacti server. We run into performance problem about a year ago: the poller cannot finish all poll items in 1 minute. I replaced the php poller with spine which is is written in native C and more powerful. It started to work fine without problem. As we have more and more hosts added, I adjusted the “Maximum Concurrent Poller Processes” and “Maximum Threads per Process” at the same steps, and cacti kept hold its position to finish pollers in 1 minute.

At the same time, the hosts load kept increase, and reached 45 recently on this 24 virtual CPU(2 cores) physical host. It starts to run timeout for some hosts recently. I tried to adjust “Maximum Concurrent Poller Processes” and “Maximum Threads per Process”, but it didn’t help. The host load 45 is already much more than its 24 CPU number. It’s already overloaded. We can upgrade cacti host to more powerful host to scale-up, but it didn’t solve the scale-out problem. It’ll run into the same problem sooner or later.

At this time, cacti handles ~800 hosts with 18k datasources and 18k RRDs in 1 minute. The  “Maximum Concurrent Poller Processes”  is 3 and “Maximum Threads per Process” is 60. It finishes each round in 57 seconds in average. The serever CPU mode is ” Intel(R) Xeon(R) CPU   X5670  @ 2.93GHz”, 2 cores with 24 VCPU.

Although cacti has “Distributed and remote polling” in it’s road map, but the release date is unknown. That’ll help solve the problem of putting all load on a standalone host. We decided to stop adding more hosts to cacti, and pursuit the other solution.

Debug and fix cacti graph trees and hosts missing problem

Recently we run into a problem that some trees and hosts were disappeared suddenly from graph page, but they existed in Management “Graph Trees” list.

I googled around and didn’t find a useful clue. So I debug the problem myself. I checked around and opened the graph page to see if there’s any error. I guess it may caused by some JavaScript errors, and silently ignored by web browser. So I opened chrome’s console. I found there’s an error as follow:

cacti_error

It showed something’s wrong with the host “crp-wikidbstg02_3307”. So it’s a problem caused or triggered by this host, either run into a cacti bug, or this host has some problems in configuration. One way is to check the detail config data of this host in cacti and find out the flaw. I chose the other way, an easy way: just delete this host, and let the auto-add job to re-add it later. As expected, the missed trees and hosts were back in graph page, even after the host is re-added. 

If you run into the same problem, you may try this way to see if it’s a similar problem.

A script to debug cacti

My check_cacti.sh reports such warning frequently.

01/09/2013 05:50:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
01/09/2013 05:53:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
01/09/2013 05:56:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate
01/09/2013 05:59:01 PM - POLLER: Poller[0] WARNING: There are '1' detected as overrunning a polling process, please investigate

It basically means some pollers cannot finish in time. But it didn’t report which poller items.

So I wrote below script to check it.

#!/bin/bash

if [ $# -lt 1 ]; then
cat <<EOF
  Check long run cacti poller commands
  usage:  debug_cacti.sh <threshold_second> [thresh_minutes]
  examples:      
          debug_cacti.sh 30   #check command run longer than 30 seconds in 3 minutes
          debug_cacti.sh 30 5 #check command run longer than 30 seconds in 5 minutes
EOF
  exit 1
fi

# arguments
thresh_sec="$1"
thresh_min="$2"
if [ "alex$thresh_min" = "alex" ] ; then
   thresh_min=3
fi

# configuration


echo "Check poller command longer than $thresh_sec seconds at next $thresh_min minutes..."
echo "Start time : `date +%M:%S`"
start_min=`date +%M`
min=`date +%M`
elapsed_min=0
while [ $elapsed_min -lt $thresh_min ] 
do
  #sleep a while
  sleep_sec=`expr $thresh_sec - $sec + 1`
  if [ $sleep_sec -gt 0 ] ; then
    echo "Sleep $sleep_sec seconds "
    sleep $sleep_sec
  fi

  sec=`date +%S`
  while [ $sec -gt $thresh_sec ] 
  do
    echo "Time : $min:$sec"
    ps -ef |grep php |grep -v grep
    sleep 3
    sec=`date +%S`
  done

  #calculate elapsed time
  min=`date +%M`
  elapsed_min=`expr $min - $start_min`
  if [ $elapsed_min -lt 0 ] ; then
     #round at 60
     elapsed_min=`expr $elapsed_min + 60`
  fi
done

One of my output as follow, it shows that the poller on the cassandra host is slow (the last one to finish).

./debug_cacti.sh 30
Check poller command longer than 30 at next 3 minutes...
Start time : 29:
Time = 29:31
root     27597     1  0 23:28 ?        00:00:00 /usr/bin/php -q /export/home/cacti-0.8.7g/cmd.php 0 67
root     29338 29337  0 23:29 ?        00:00:00 /bin/sh -c php /var/www/html/cacti/poller.php > /dev/null  2>&1 
root     29339 29338  1 23:29 ?        00:00:00 php /var/www/html/cacti/poller.php
root     30795 27597  0 23:29 ?        00:00:00 php /export/home/cacti-0.8.7g/scripts/ss_get_cassandra_stats.php --host sharedcass.alexzeng.wordpress.com --port --user --pass --items dp
Time = 29:34
root     27597     1  0 23:28 ?        00:00:00 /usr/bin/php -q /export/home/cacti-0.8.7g/cmd.php 0 67
root     30981 27597  0 23:29 ?        00:00:00 php /export/home/cacti-0.8.7g/scripts/ss_get_cassandra_stats.php --host sharedcass.alexzeng.wordpress.com --port --user --pass --items dc,dd
Time = 29:37
Time = 29:40
Time = 29:43
Time = 29:46
Time = 29:49
Time = 29:52
Time = 29:55
Time = 29:58
...

Now I got the problem. It’s halfway success.