Debug Cassandrar JVM thread 100% CPU usage issue

Recently, one of our cassandra nodes run into an issue: one CPU thread utilization is 100% while the others are almost idle. The node shows “Down” in the nodetool now and then.

Java supports threads, and threads can use different CPU thread without problem in theory. But why one CPU is 100% while the others are idle? What the 100% utilization CPU is doing? Here is what I did:

1. Find out the thread ID that uses 100% CPU
In this Linux host, I run top to see the CPU usage, and then type 1 to show each CPU thread uage, then type H to show thread of processes. When I see their is one CPU thread is 100% usage, I can see the top thread as below.

 35376 cassandr  20   0 28.8g  10g 1.2g R 99.7  8.0  11:13.08 java

Its CPU usage is 99.7%

2. Find out what the thread is doing
I use “jstack -l ” to dump all JVM thread calling stacks. I need to find the thread with ID 35376. The number of thread id in jstack dump file is hexadecimal format while it’s decimal format in top output. Decimal 35376 equals 8a30 HEX. Got it in the output:

...
"VM Thread" prio=10 tid=0x00007f2a78313000 nid=0x8a30 runnable
...

So I know that “VM Thread” is the culprit. From my basic understanding, its main job is GC(Garbage Collection). If there is no memory leak problem in the application java code, GC can be improved by adjust JVM HEAP SIZE parameters. First I need to check the current heap usage.

3. Get JVM heap usage
We can just “jmap -heap “:

$ ./jmap -heap 122576
Attaching to process ID 122576, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.10-b01

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 8589934592 (8192.0MB)
   NewSize          = 2147483648 (2048.0MB)
   MaxNewSize       = 2147483648 (2048.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 1932787712 (1843.25MB)
   used     = 1932773920 (1843.2368469238281MB)
   free     = 13792 (0.013153076171875MB)
   99.99928641930438% used
Eden Space:
   capacity = 1718091776 (1638.5MB)
   used     = 1718091768 (1638.4999923706055MB)
   free     = 8 (7.62939453125E-6MB)
   99.99999953436713% used
From Space:
   capacity = 214695936 (204.75MB)
   used     = 214682152 (204.73685455322266MB)
   free     = 13784 (0.01314544677734375MB)
   99.99357975737371% used
To Space:
   capacity = 214695936 (204.75MB)
   used     = 0 (0.0MB)
   free     = 214695936 (204.75MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 6442450944 (6144.0MB)
   used     = 6442450896 (6143.999954223633MB)
   free     = 48 (4.57763671875E-5MB)
   99.99999925494194% used
Perm Generation:
   capacity = 58068992 (55.37890625MB)
   used     = 34732264 (33.123268127441406MB)
   free     = 23336728 (22.255638122558594MB)
   59.81206630898639% used

New Generation is 99.999% usage. This could be a problem because when the code create a new object it will need get memory from this part. If it’s full, JVM needs to scan the memory area to release memory. Before it’s done, the program can do nothing but wait.

4. Adjust JVM heap size
I set the parameters in cassandra-env.sh :

MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"

It will set it in JVM options as follows:

JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmn${HEAP_NEWSIZE}"

5. Check result
After restart cassandra, and let it run a while, I run jmap again:

$ ./jmap -heap 35339
Attaching to process ID 35339, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.10-b01

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 17179869184 (16384.0MB)
   NewSize          = 4294967296 (4096.0MB)
   MaxNewSize       = 4294967296 (4096.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 3865509888 (3686.4375MB)
   used     = 1271912480 (1212.9902648925781MB)
   free     = 2593597408 (2473.447235107422MB)
   32.90413210294704% used
Eden Space:
   capacity = 3436052480 (3276.875MB)
   used     = 1013629048 (966.671989440918MB)
   free     = 2422423432 (2310.203010559082MB)
   29.499812761881913% used
From Space:
   capacity = 429457408 (409.5625MB)
   used     = 258283432 (246.31827545166016MB)
   free     = 171173976 (163.24422454833984MB)
   60.14180386428449% used
To Space:
   capacity = 429457408 (409.5625MB)
   used     = 0 (0.0MB)
   free     = 429457408 (409.5625MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 12884901888 (12288.0MB)
   used     = 3744285600 (3570.8290100097656MB)
   free     = 9140616288 (8717.170989990234MB)
   29.059480875730515% used
Perm Generation:
   capacity = 62611456 (59.7109375MB)
   used     = 37563512 (35.82335662841797MB)
   free     = 25047944 (23.88758087158203MB)
   59.99463101449038% used

In another case, it’s tomcat and home-grown java application, we got similar 100% CPU usage on one thread. While the solution is opposite : I reduced the JVM heap size to fix it because the default MAX JVM heap size is 32GB which is much more than enough. The program performance is good at start, but it will run slower and slower. Because setting too large heap size may also causes similar problem because the bigger the JVM heap size, the more work GC need to do, similar to Oracle shared pool size.

jmap and jstack is available in JDK. If you don’t have JDK, you need to download the exact same version(including minor version) JDK of the corresponding JRE otherwise it will complain incompatible version.

Kindly reminder, I am a newbie to java, thus my understanding may be wrong. Please use your own judgement on the contents.

References:
http://java.sys-con.com/node/1611555
http://middlewaremagic.com/weblogic/?tag=young-generation

Advertisements

About Alex Zeng
I would be very happy if this blog can help you. I appreciate every honest comments. Please forgive me if I'm too busy to reply your comments in time.

One Response to Debug Cassandrar JVM thread 100% CPU usage issue

  1. jane says:

    A nice tutorial on
    JVM Architecture Specification Basic The Heap Area Introduction, teach about the JVM Heap Area in details

    JVM Architecture Specification Basic The Method Area explained, teach about the JVM method area

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: