Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra cleanup on several servers at once

We have a big Cassandra cluster 18 Servers (on one server near 5T data )

http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html - We have added a new nodes following this documentation .

After we have added new servers, we began the process of cleaning data (nodetool cleanup)

In the documentation advise: After all new nodes are running, run nodetool cleanup on each of the previously existing nodes to remove the keys no longer belonging to those nodes. Wait for cleanup to complete on one node before doing the next)

But cleanup for one server takes near 2 - 3 days in our case. My question is can I start cleaning at once on multiple servers, 2 or 3 ...

Or it may lead to data loss ?

Some more info .

We use cassandra 2.0.13 with vnodes . Also We keep files in blons in cassandra .

Replication factor = 3

like image 533
Anatoliy Laktionov Avatar asked May 30 '15 10:05

Anatoliy Laktionov


People also ask

How do I free up space on Cassandra?

You can drop or truncate tables. This solution is quite efficient because no tombstones are written. Cassandra just create a snapshot of the table when you run the command. The disk space is released when you clear the snapshot.

When should I run Nodetool cleanup?

You should run nodetool cleanup whenever you scale-out (expand) your cluster, and new nodes are added to the same DC. The scale out process causes the token ring to get re-distributed. As a result, some of the nodes will have replicas for tokens that they are no longer responsible for (taking up disk space).

What does Nodetool scrub do?

Scrub automatically discards broken data and removes any tombstoned rows that have exceeded gc_grace period of the table. If partition key values do not match the column data type, the partition is considered corrupt and the process automatically stops.

How do I remove dead nodes from Cassandra cluster?

You can take a node out of the cluster with nodetool decommission to a live node, or nodetool removenode (to any other machine) to remove a dead one. This will assign the ranges the old node was responsible for to other nodes, and replicate the appropriate data there.


1 Answers

Cleanup doesn't involve any other nodes so it is safe to run in parallel. However, you may want to run on one at once to reduce the performance impact since cleanup may use lots of disk I/O.

like image 192
Richard Avatar answered Sep 28 '22 09:09

Richard