Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does cassandra split keyspace data when multiple directories are configured?

Tags:

cassandra

I have configured three separate data directories in cassandra.yaml file as given below:

data_file_directories:
    - E:/Cassandra/data/var/lib/cassandra/data
    - K:/Cassandra/data/var/lib/cassandra/data

when I create keyspace and insert data my key space got created in both two directories and data got scattered. what I want to know is how cassandra splits the data between multiple directories?. And what is the rule behind this?

like image 251
vignesh kumar rathakumar Avatar asked Apr 10 '13 12:04

vignesh kumar rathakumar


People also ask

How does Cassandra store data internally?

When a write occurs, Cassandra stores the data in a memory structure called memtable, and to provide configurable durability, it also appends writes to the commit log on disk. The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even if power fails on a node.

What is Listen_address in Cassandra?

listen_address. (Default: localhost) The IP address or hostname that Cassandra binds to for connecting to other Cassandra nodes.

Which are configuration files in Cassandra?

The configuration files of Cassandra are located in the /etc/cassandra directory. cassandra. yaml is the file that contains most of the Cassandra configuration, such as ports used, file locations and seed node IP addresses.

How much data can you store in Cassandra?

Maximum recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data. For Cassandra 1.1, it is 500 to 800GB per node.


1 Answers

You are using the JBOD feature of Cassandra when you add multiple entries under data_file_directories. Data is spread evenly over the configured drives proportionate to their available space.

This also let's you take advantage of the disk_failure_policy setting. You can read about the details here: http://www.datastax.com/dev/blog/handling-disk-failures-in-cassandra-1-2

In short, you can configure Cassandra to keep going, doing what it can if the disk becomes full or fails completely. This has advantages over RAID0 (where you would effectively have the same capacity as JBOD) in that you do not have to replace the whole data set from backup (or full repair) but just run a repair for the missing data. On the other hand, RAID0 provides higher throughput (depending how well you know how to tune RAID arrays to match filesystem and drive geometry).

If you have the resources for fault-tolerant/more performant RAID setup (like RAID10 for example), you may want to just use a single directory for simplicity. Most deployments are starting to lean towards the density route, using JBOD rather than systems-level tolerance though.

You can read about the thought process behind the development of this issue here: https://issues.apache.org/jira/browse/CASSANDRA-4292

like image 196
zznate Avatar answered Sep 21 '22 20:09

zznate