I recently started working with Cassandra database. I have installed single node cluster
in my local box. And I am working with Cassandra 1.2.3
.
I was reading the article on the internet and I found this line-
Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table).
So to understand the above lines, I wrote a simple program that will write to Cassandra Database using Pelops client
. And I was able to insert the data in Cassandra database.
And now I am trying to see how my data was written into commit log
and where that commit log file
is? And also how SSTables
is generated and where I can find that as well in my local box and what it contains also.
I wanted to see these two files so that I can understand more how Cassandra works behind the scenes.
In my cassandra.yaml file, I have something like this
# directories where Cassandra should store data on disk.
data_file_directories:
- S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data
# commit log
commitlog_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\commitlog
# saved caches
saved_caches_directory: S:\Apache Cassandra\apache-cassandra-1.2.3\storage\savedcaches
But when I opened commitLog, first of all it has lot of data so my notepad++ is not able to open it properly and if it gets opened, I cannot see properly because of some encoding or what. And in my data folder, I cannot find out anything?
Meaning this folder is empty for me-
S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data\my_keyspace\users
Is there anything I am missing here? Can anybody explain me how to read commitLog and SSTables files and where I can find these two files? And also what exactly happens behind the scenes whenever I am writing to Cassandra database.
Updated:-
Code I am using to insert into Cassandra Database-
public class MyPelops {
private static final Logger log = Logger.getLogger(MyPelops.class);
public static void main(String[] args) throws Exception {
// -------------------------------------------------------------
// -- Nodes, Pool, Keyspace, Column Family ---------------------
// -------------------------------------------------------------
// A comma separated List of Nodes
String NODES = "localhost";
// Thrift Connection Pool
String THRIFT_CONNECTION_POOL = "Test Cluster";
// Keyspace
String KEYSPACE = "my_keyspace";
// Column Family
String COLUMN_FAMILY = "users";
// -------------------------------------------------------------
// -- Cluster --------------------------------------------------
// -------------------------------------------------------------
Cluster cluster = new Cluster(NODES, 9160);
Pelops.addPool(THRIFT_CONNECTION_POOL, cluster, KEYSPACE);
// -------------------------------------------------------------
// -- Mutator --------------------------------------------------
// -------------------------------------------------------------
Mutator mutator = Pelops.createMutator(THRIFT_CONNECTION_POOL);
log.info("- Write Column -");
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Name ".getBytes()).setValue(" Test One ".getBytes()).setTimestamp(new Date().getTime()));
mutator.writeColumn(
COLUMN_FAMILY,
"Row1",
new Column().setName(" Work ".getBytes()).setValue(" Engineer ".getBytes()).setTimestamp(new Date().getTime()));
log.info("- Execute -");
mutator.execute(ConsistencyLevel.ONE);
// -------------------------------------------------------------
// -- Selector -------------------------------------------------
// -------------------------------------------------------------
Selector selector = Pelops.createSelector(THRIFT_CONNECTION_POOL);
int columnCount = selector.getColumnCount(COLUMN_FAMILY, "Row1",
ConsistencyLevel.ONE);
System.out.println("- Column Count = " + columnCount);
List<Column> columnList = selector
.getColumnsFromRow(COLUMN_FAMILY, "Row1",
Selector.newColumnsPredicateAll(true, 10),
ConsistencyLevel.ONE);
System.out.println("- Size of Column List = " + columnList.size());
for (Column column : columnList) {
System.out.println("- Column: (" + new String(column.getName()) + ","
+ new String(column.getValue()) + ")");
}
System.out.println("- All Done. Exit -");
System.exit(0);
}
}
Keyspace and Column family that I have created-
create keyspace my_keyspace with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1};
use my_keyspace;
create column family users with column_type = 'Standard' and comparator = 'UTF8Type';
Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied.
Memtable — a memory cache to store the in memory copy of the data. Each node has a memtable for each CQL table. The memtable accumulates writes and provides read for data which are not yet stored to disk. SSTable —the final destination of data in C*. They are actual files on disk and are immutable.
The SSTables are files stored on disk. The naming convention for SSTable files has changed with Cassandra 2.2 and later to shorten the file path. The data files are stored in a data directory that varies with installation. For each keyspace, a directory within the data directory stores each table.
Sorted Strings Table (SSTable) is a persistent file format used by ScyllaDB, Apache Cassandra, and other NoSQL databases to take the in-memory data stored in memtables, order it for fast access, and store it on disk in a persistent, ordered, immutable set of files.
You are almost there in your understanding. However, missing some minute details.
So explaining things in a structured way, cassandra write operation life cycle is divided in these steps
Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is said to successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. When ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space (This mechanism is called Flushing). Once writes are done on SSTable, then you can see the corresponding datas in the data folder, in your case its S:\Apache Cassandra\apache-cassandra-1.2.3\storage\data
. Each SSTable composes of mainly 2 files - Index file and Data file
Index file contains - Bloom filter and Key-Offset pairs
Data file contains the actual column data
And regarding commitlog files, these are encrypted files maintained intrinsically by Cassandra, for which you are not able to see anything properly.
UPDATE:
Memtable is an in-memory cache with content stored as key/column (data are sorted by key). Each column-family has a separate Memtable and retrieve column data from the key. So now i hope you are in clear state of mind to understand the fact, why we can't locate them in our disk.
In your case your memtable is not full as memtable thresholds are not bleached yet resulting to no flushing. You can know more about MemtableThresholds here though it is recommended not to touch that Dial.
SSTableStructure:
For more information Refer sstable
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With