When experimenting with Cassandra I've observed that Cassandra writes to the following files:
/.../cassandra/commitlog/CommitLog-<id>.log
/.../cassandra/data/Keyspace1/Standard1-1-Data.db
/.../cassandra/data/Keyspace1/Standard1-1-Filter.db
/.../cassandra/data/Keyspace1/Standard1-1-Index.db
/.../cassandra/data/system/LocationInfo-1-Data.db
/.../cassandra/data/system/LocationInfo-1-Filter.db
/.../cassandra/data/system/LocationInfo-1-Index.db
/.../cassandra/data/system/LocationInfo-2-Data.db
/.../cassandra/data/system/LocationInfo-2-Filter.db
/.../cassandra/data/system/LocationInfo-2-Index.db
/.../cassandra/data/system/LocationInfo-3-Data.db
/.../cassandra/data/system/LocationInfo-3-Filter.db
/.../cassandra/data/system/LocationInfo-3-Index.db
/.../cassandra/system.log
The general structure seems to be:
/.../cassandra/commitlog/CommitLog-ID.log
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Data.db
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Filter.db
/.../cassandra/data/KEYSPACE/COLUMN_FAMILY-N-Index.db
/.../cassandra/system.log
What is the Cassandra file structure? More specifically, how are the data
, commitlog
directories used, and what is the structure of the files in the data
directory (Data
/Filter
/Index
)?
Commit Log- Whenever any write operation is handled by Cassandra, the data is simultaneously written to both Memtable & Commit Log. The main purpose of Commit Log is to recreate the Memtable in case if a node gets crashed, Commit Log is a flat file which is created on Disk.
Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can result in large latencies in read performance and other problems.
How Cassandra combines results from the active memtable and potentially multiple SSTables to satisfy a read. To satisfy a read, Cassandra must combine results from the active memtable and potentially multiple SSTables.
In Cassandra, the data itself is automatically distributed, with (positive) performance consequences. It accomplishes this using partitions. Each node owns a particular set of tokens, and Cassandra distributes data based on the ranges of these tokens across the cluster.
A write to a Cassandra node first hits the CommitLog (sequential). (Then Cassandra stores values to column-family specific, in-memory data structures called Memtables. The Memtables are flushed to disk whenever one of the configurable thresholds is exceeded. (1, datasize in memtable. 2, # of objects reach certain limit, 3, lifetime of a memtable expires.))
The data folder contains a subfolder for each keyspace. Each subfolder contains three kind of files:
Cassandra File Format in detail
Each ColumnFamily(Eg. object) in separated sstable files
ColumnFamilyName-version-#-Data.db
ColumnFamilyName-version-#-Index.db
ColumnFamilyName-version-#-Filter.db
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With