Conceptual difference concerning column families in Cassandras data model compared to Bigtable?

Tags:

I am currently trying to dig into Cassandra's data model and its relation to Bigtable, but ended up with a strong headache concerning the Column Family concept.

Mainly my question was asked and already answered. However, I'm not satisfied with the answers :)

Firstly I've read the Bigtable paper especially concerning its data model, i.e. how data is stored. As far as I understood each table in Bigtable basically relies on a multi-dimensional sparse map with the dimensions row, column and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple column families (see the example figure in the paper).

Although it is stated that Cassandra relies on Bigtable data model, I read multiple times that in Cassandra a column family contains multiple rows and is to some extent comparable to a table in relational data stores. Isn't this contrary to Bigtable's approach, where a row could contain multiple column families? What comes first, the column family or row :)? Are these concepts even comparable?

909

asked Nov 05 '17 11:11

OxideNt

1 Answers

The answer you linked to was from 6 years ago, and a lot has changed in Cassandra since. When Cassandra started out, its data model was indeed based on BigTable's. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called "schema-less", because there is no schema that each row needs to adhere to.

But Toto, we're not in Kansas any more - and Cassandra's model changed in focus (though not in essense) since, and I'll try to explain how and why:

As Cassandra matured, its developers started to realize that schema-less isn't as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn't normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a "clustering key". So around the time of Cassandra 0.8, six years ago, these ideas where introduced to Cassandra as the "CQL" (Cassandra Query Language).

For example, in CQL one declares that a column-family (which was dutifully renamed "table") has a schema, with a known list of fields:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
)

This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a "partition") with the key "groupname" is a a possibly long list of users, each with username, email and age fields. The first name in the "PRIMARY KEY" specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).

Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group "mygroup" with two people, (john, [email protected], 27) and (joe, [email protected], 38). Cassandra adds the following four column names->values to the wide row:

john:email -> [email protected]
john:age -> 27
joe:email -> [email protected]
joe:age -> 27

Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field "username" no longer appears anywhere as the value, but rather as part of the column's name! So If we have two username values "john" and "joe", We have some columns prefixed "john" and some columns prefixed "joe", and when we read the column "joe:email" we know this is the value of the email field of the row which has username=joe.

Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Until recently, Cassandra's on-disk format known as "SSTables" was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla's site https://github.com/scylladb/scylla/wiki/SSTables-Data-File (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra recently (in version 3.0) switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first class citizens. This was the last nail in the coffin of the schema-less Cassandra from 7 years ago. Cassandra is now schema-full, all the way.

answered Sep 22 '22 15:09

Nadav Har'El

Related questions
                            
                                Memtable understanding
                            
                                What are the maximum number of columns allowed in Cassandra
                            
                                How do I connect to Cassandra with Dbeaver Community edition?
                            
                                how to do data migration in cassandra
                            
                                Why am I seeing "Nodetool status connection refused"?
                            
                                Maximum key size in Cassandra
                            
                                Why Apache Cassandra writes are so slow compared to MongoDB, Redis & MySql [closed]
                            
                                #Cassandra - What is the difference between nodetool removenode , decommission, assassinate, replace?
                            
                                Significance of Vnodes in Cassandra
                            
                                NoHostAvailable error in cqlsh console
                            
                                Cassandra and MapReduce - minimal setup requirements
                            
                                Differences betweeen Hector Cassandra and JDBC
                            
                                When cassandra-driver was executing the query, cassandra-driver returned error OperationTimedOut
                            
                                Cassandra IllegalArgumentException creating keyspace
                            
                                How seed node works in Cassandra cluster
                            
                                Spark-Cassandra Connector : Failed to open native connection to Cassandra
                            
                                Cassandra Datastax Driver Retry Policy
                            
                                Drop all keyspaces in cassandra
                            
                                Scylla datacenter and Cassandra datacenter in same cluster
                            
                                Cassandra and asp.net (C#)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Conceptual difference concerning column families in Cassandras data model compared to Bigtable?

Tags:

nosql

cassandra

google-cloud-bigtable

scylla

OxideNt

People also ask

1 Answers

Nadav Har'El

Recent Activity

Donate For Us