Why is my Cassandra node stuck with MutationStage increasing?

Tags:

I'm using Cassandra to store pictures. We are currently mass migrating pictures from an old system. Everything works great for a while, but eventually we'd get a TimedOutException when saving which I assume is because the work queue was filled.

However, after waiting (several hours) for it to finish, the situation continues the same (it doesn't recover itself after stopping the migration)

There seems to be a problem with only 1 node, on which its tpstats command shows the following data

Cassandra tpstats

The pending MutationStage operations keep increasing even though we stopped the inserts hours ago.

What exactly does that mean? What is the MutationStage?

What can I check to see why it isn't stabilising after so long? All the other servers in the ring are at 0 pending operations.

Any new insert we attempt throws the TimedOutException... exception.

This is the ring information in case it's useful

enter image description here
(the node with issues is the first one)

EDIT: The last lines in the log are as follows

INFO [OptionalTasks:1] 2013-02-05 10:12:59,140 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 92972117 bytes)  
INFO [OptionalTasks:1] 2013-02-05 10:12:59,141 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@916497516(74377694/92972117 serialized/live bytes, 141 ops)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,205 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 80689206 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,207 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@800272493(64551365/80689206 serialized/live bytes, 113 ops)
WARN [MemoryMeter:1] 2013-02-05 10:16:10,662 Memtable.java (line 197) setting live ratio to minimum of 1.0 instead of 0.0015255633589225548
INFO [MemoryMeter:1] 2013-02-05 10:16:10,663 Memtable.java (line 213) CFS(Keyspace='pics_persistent', ColumnFamily='master') liveRatio is 1.0 (just-counted was 1.0).  calculation took 38ms for 86 columns
INFO [OptionalTasks:1] 2013-02-05 10:16:33,267 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 71029403 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:16:33,269 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@143498560(56823523/71029403 serialized/live bytes, 108 ops)
INFO [ScheduledTasks:1] 2013-02-05 11:36:27,798 GCInspector.java (line 122) GC for ParNew: 243 ms for 1 collections, 1917768456 used; max is 3107979264
INFO [ScheduledTasks:1] 2013-02-05 13:00:54,090 GCInspector.java (line 122) GC for ParNew: 327 ms for 1 collections, 1966976760 used; max is 3107979264

282

asked Feb 05 '13 18:02

juan

1 Answers

I guess you're just overloading one of your nodes with writes - i.e. you write faster than it is capable to digest. This is pretty easy if your writes are huge.

The MutationStage is increasing even after you stopped writing to the cluster, because the other nodes are still processing queued mutation requests and send replicas to this overloaded node.

I don't know why one of the node gets overloaded, because there may be several reasons:

the node is slower than the others (different hardware or different configuration)
the cluster is not properly balanced (however, the beginning of your nodetool ring output suggests it is not the case)
you're directing all your writes to this particular node instead of distributing them to all nodes equally, e.g. by round-robin
you configured too big total memtables size limit / or cache sizes for too little total heap space and your nodes are struggling with GC and it just happened that this one was the first one to fall into the GC death spiral

115

answered Oct 24 '22 11:10

Piotr Kołaczkowski

Related questions
                            
                                Android SSLSocket handshake failure in Android 6 and above
                            
                                Retrofit - Different API responses on different devices
                            
                                Cant understand advantage of overridden implementation of equals method in ConcurrentHashMap
                            
                                IntelliJ IDEA Gradle project not recognizing/locating Antlr generated sources
                            
                                Check if headset is connected on different API levels
                            
                                Add Alt + F7 as shortcut to an action in Netbeans platform
                            
                                How to set a timeout on a Spring Boot REST API?
                            
                                Gzip compression not working in my project with Spring boot 1.5.10.RELEASE
                            
                                Finding all the paths between two vertices with weight limit in a directed graph
                            
                                Convert Byte Array to image in Java - without knowing the type
                            
                                Java payment gateway library [closed]
                            
                                How to capture video using JMF, but without installing JMF
                            
                                Asserting order of synchronization in Java
                            
                                Scope of System.setProperty
                            
                                How to convert SVG into PNG on-the-fly
                            
                                How to customize blocking behavior of BlockingQueue
                            
                                Java Swing: how to create assist functionality to teach user how to use the software?
                            
                                Java EE Application-scoped variables in a clustered environment (Websphere)?
                            
                                adding JRadioButton to RadioButton group
                            
                                How to use Renderer for TableHeader

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is my Cassandra node stuck with MutationStage increasing?

Tags:

java

performance

optimization

cassandra

deadlock

juan

People also ask

1 Answers

Piotr Kołaczkowski

Recent Activity

Donate For Us