Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conceptual difference concerning column families in Cassandras data model compared to Bigtable?

I am currently trying to dig into Cassandra's data model and its relation to Bigtable, but ended up with a strong headache concerning the Column Family concept.

Mainly my question was asked and already answered. However, I'm not satisfied with the answers :)

Firstly I've read the Bigtable paper especially concerning its data model, i.e. how data is stored. As far as I understood each table in Bigtable basically relies on a multi-dimensional sparse map with the dimensions row, column and time. The map is sorted by rows. Columns can be grouped with the name convention family:qualifier to a column family. Therefore, a single row can contain multiple column families (see the example figure in the paper).

Although it is stated that Cassandra relies on Bigtable data model, I read multiple times that in Cassandra a column family contains multiple rows and is to some extent comparable to a table in relational data stores. Isn't this contrary to Bigtable's approach, where a row could contain multiple column families? What comes first, the column family or row :)? Are these concepts even comparable?

like image 909
OxideNt Avatar asked Nov 05 '17 11:11

OxideNt


People also ask

What is column family in Bigtable?

Column families are defined when the table is first created. Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. The implementation of BigTable usually compresses all the columns within a column family together.

Is Bigtable a column family database?

Bigtable storage model Each row is indexed by a single row key, and columns that are related to one another are typically grouped into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.

What are the column family attributes in Cassandra?

Cassandra Column Family Types Each column in a Cassandra column family database has three main values: key or column name, value, and an uneditable time stamp that reflects when the column was updated.

What is the difference between a column and a super column in a column family database?

A column family can contain super columns or columns. A super column is a dictionary, it is a column that contains other columns (but not other super columns). A column is a tuple of name, value and timestamp (I'll ignore the timestamp and treat it as a key/value pair from now on).


1 Answers

The answer you linked to was from 6 years ago, and a lot has changed in Cassandra since. When Cassandra started out, its data model was indeed based on BigTable's. A row of data could include any number of columns, each of these columns has a name and a value. A row could have a thousand different columns, and a different row could have a thousand other columns - rows do not have to have the same columns. Such a database is called "schema-less", because there is no schema that each row needs to adhere to.

But Toto, we're not in Kansas any more - and Cassandra's model changed in focus (though not in essense) since, and I'll try to explain how and why:

As Cassandra matured, its developers started to realize that schema-less isn't as great as they once thought it was. Schemas are valuable in ensuring application correctness. Moreover, one doesn't normally get to 1000 columns in a single row just because there are 1000 individually-named fields in one record. Rather, the more common case is that the record actually contains 200 entries, each with 5 fields. The schema should fix these 5 fields that every one of these entries should have, and what defines each of these separate entries is called a "clustering key". So around the time of Cassandra 0.8, six years ago, these ideas where introduced to Cassandra as the "CQL" (Cassandra Query Language).

For example, in CQL one declares that a column-family (which was dutifully renamed "table") has a schema, with a known list of fields:

CREATE TABLE groups (
    groupname text,
    username text,
    email text,
    age int,
    PRIMARY KEY (groupname, username)
)

This schema says that each wide row in the table (now, in modern Cassandra, this was renamed a "partition") with the key "groupname" is a a possibly long list of users, each with username, email and age fields. The first name in the "PRIMARY KEY" specifier is the partition key (it determines the key of the wide rows), and the second is called the clustering key (it determines the key of the small rows that together make up the wide rows).

Despite the new CQL dressup, Cassandra continued to implement these new concepts using the good-old-BigTable-wide-row-without-schema implementation. For example, consider that our data has a group "mygroup" with two people, (john, [email protected], 27) and (joe, [email protected], 38). Cassandra adds the following four column names->values to the wide row:

john:email -> [email protected]
john:age -> 27
joe:email -> [email protected]
joe:age -> 27

Note how we ended up with a wide row with 4 columns - 2 non-key fields per row (email and age), multiplied by the number of rows in the partition (2). The clustering key field "username" no longer appears anywhere as the value, but rather as part of the column's name! So If we have two username values "john" and "joe", We have some columns prefixed "john" and some columns prefixed "joe", and when we read the column "joe:email" we know this is the value of the email field of the row which has username=joe.

Cassandra still has this internal duality - converting the user-facing CQL rows and clustering keys into old-style wide rows. Until recently, Cassandra's on-disk format known as "SSTables" was still schema-less and used composite names as shown above for column names. I wrote a detailed description of the SSTable format on Scylla's site https://github.com/scylladb/scylla/wiki/SSTables-Data-File (Scylla is a more efficient C++ re-implementation of Cassandra to which I contribute). However, column names are very inefficient in this format so Cassandra recently (in version 3.0) switched to a different file format, which for the first time, accepts clustering keys and schema-full rows as first class citizens. This was the last nail in the coffin of the schema-less Cassandra from 7 years ago. Cassandra is now schema-full, all the way.

like image 87
Nadav Har'El Avatar answered Sep 22 '22 15:09

Nadav Har'El