I'm using Cassandra to store pictures. We are currently mass migrating pictures from an old system. Everything works great for a while, but eventually we'd get a TimedOutException
when saving which I assume is because the work queue was filled.
However, after waiting (several hours) for it to finish, the situation continues the same (it doesn't recover itself after stopping the migration)
There seems to be a problem with only 1 node, on which its tpstats
command shows the following data
The pending MutationStage operations keep increasing even though we stopped the inserts hours ago.
What exactly does that mean? What is the MutationStage?
What can I check to see why it isn't stabilising after so long? All the other servers in the ring are at 0 pending operations.
Any new insert we attempt throws the TimedOutException
... exception.
This is the ring information in case it's useful
(the node with issues is the first one)
EDIT: The last lines in the log are as follows
INFO [OptionalTasks:1] 2013-02-05 10:12:59,140 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 92972117 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:12:59,141 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@916497516(74377694/92972117 serialized/live bytes, 141 ops)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,205 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 80689206 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:14:49,207 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@800272493(64551365/80689206 serialized/live bytes, 113 ops)
WARN [MemoryMeter:1] 2013-02-05 10:16:10,662 Memtable.java (line 197) setting live ratio to minimum of 1.0 instead of 0.0015255633589225548
INFO [MemoryMeter:1] 2013-02-05 10:16:10,663 Memtable.java (line 213) CFS(Keyspace='pics_persistent', ColumnFamily='master') liveRatio is 1.0 (just-counted was 1.0). calculation took 38ms for 86 columns
INFO [OptionalTasks:1] 2013-02-05 10:16:33,267 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='pics_persistent', ColumnFamily='master') (estimated 71029403 bytes)
INFO [OptionalTasks:1] 2013-02-05 10:16:33,269 ColumnFamilyStore.java (line 643) Enqueuing flush of Memtable-master@143498560(56823523/71029403 serialized/live bytes, 108 ops)
INFO [ScheduledTasks:1] 2013-02-05 11:36:27,798 GCInspector.java (line 122) GC for ParNew: 243 ms for 1 collections, 1917768456 used; max is 3107979264
INFO [ScheduledTasks:1] 2013-02-05 13:00:54,090 GCInspector.java (line 122) GC for ParNew: 327 ms for 1 collections, 1966976760 used; max is 3107979264
Check the status of the Cassandra nodes in your cluster - Go to the /<Install_Dir>/apache-cassandra/bin/ directory and type the ./nodetool status command. If the status for all the nodes shows as UN , then the nodes are up and running. If the status for any node shows as DN , then that particular node is down.
Mutation is a thrift-generated class defined in the cassandra.thrift file.
I guess you're just overloading one of your nodes with writes - i.e. you write faster than it is capable to digest. This is pretty easy if your writes are huge.
The MutationStage is increasing even after you stopped writing to the cluster, because the other nodes are still processing queued mutation requests and send replicas to this overloaded node.
I don't know why one of the node gets overloaded, because there may be several reasons:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With