Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra file structure - how are the files used?

Tags:

When experimenting with Cassandra I've observed that Cassandra writes to the following files:

/.../cassandra/commitlog/CommitLog-<id>.log
/.../cassandra/data/Keyspace1/Standard1-1-Data.db
/.../cassandra/data/Keyspace1/Standard1-1-Filter.db
/.../cassandra/data/Keyspace1/Standard1-1-Index.db
/.../cassandra/data/system/LocationInfo-1-Data.db
/.../cassandra/data/system/LocationInfo-1-Filter.db
/.../cassandra/data/system/LocationInfo-1-Index.db
/.../cassandra/data/system/LocationInfo-2-Data.db
/.../cassandra/data/system/LocationInfo-2-Filter.db
/.../cassandra/data/system/LocationInfo-2-Index.db
/.../cassandra/data/system/LocationInfo-3-Data.db
/.../cassandra/data/system/LocationInfo-3-Filter.db
/.../cassandra/data/system/LocationInfo-3-Index.db
/.../cassandra/system.log

The general structure seems to be:

/.../cassandra/commitlog/CommitLog-ID.log
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Data.db
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Filter.db
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Index.db
/.../cassandra/system.log

What is the Cassandra file structure? More specifically, how are the data, commitlog directories used, and what is the structure of the files in the data directory (Data/Filter/Index)?

like image 616
knorv Avatar asked Mar 01 '10 21:03

knorv


People also ask

How are files handled by Cassandra?

Commit Log- Whenever any write operation is handled by Cassandra, the data is simultaneously written to both Memtable & Commit Log. The main purpose of Commit Log is to recreate the Memtable in case if a node gets crashed, Commit Log is a flat file which is created on Disk.

What data structure does Cassandra use?

Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can result in large latencies in read performance and other problems.

How is data read in Cassandra?

How Cassandra combines results from the active memtable and potentially multiple SSTables to satisfy a read. To satisfy a read, Cassandra must combine results from the active memtable and potentially multiple SSTables.

How does Cassandra work?

In Cassandra, the data itself is automatically distributed, with (positive) performance consequences. It accomplishes this using partitions. Each node owns a particular set of tokens, and Cassandra distributes data based on the ranges of these tokens across the cluster.


2 Answers

A write to a Cassandra node first hits the CommitLog (sequential). (Then Cassandra stores values to column-family specific, in-memory data structures called Memtables. The Memtables are flushed to disk whenever one of the configurable thresholds is exceeded. (1, datasize in memtable. 2, # of objects reach certain limit, 3, lifetime of a memtable expires.))

The data folder contains a subfolder for each keyspace. Each subfolder contains three kind of files:

  • Data files: An SSTable (nomenclature borrowed from Google) stands for Sorted Strings Table and is a file of key-value string pairs (sorted by keys).
  • Index file: (Key, offset) pairs (points into data file)
  • Bloom filter: all keys in data file
like image 76
Schildmeijer Avatar answered Oct 23 '22 09:10

Schildmeijer


Cassandra File Format in detail

Each ColumnFamily(Eg. object) in separated sstable files

ColumnFamilyName-version-#-Data.db
ColumnFamilyName-version-#-Index.db
ColumnFamilyName-version-#-Filter.db

enter image description here

like image 36
leef Avatar answered Oct 23 '22 11:10

leef