Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are super columns in Cassandra no longer favoured?

I have read in the latest release that super columns are not desirable due to "performance issues", but no where is this explained.

Then I read articles such as this one that give wonderful indexing patterns using super columns.

This leave me with no idea of what is currently the best way to do indexing in Cassandra.

  1. What are the performance issues of super columns?
  2. Where can I find current best practices for indexing?
like image 645
IamIC Avatar asked Aug 11 '12 13:08

IamIC


People also ask

What is super column in Cassandra?

SuperColumn. A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns. Generally column families are stored on disk in individual files.

How many columns does Cassandra support?

Cassandra allows 2 billion columns per row.

What is CF in Cassandra?

A Cassandra column family consists of a collection of ordered columns in rows which represent a structured version of the stored data.

How are wide rows displayed in CQL?

Cassandra uses a special primary key called a composite key (or compound key) to represent wide rows, also called partitions. The composite key consists of a partition key, plus an optional set of clustering columns.


1 Answers

Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to deserialze all of the sub-columns of a super column when querying (even if the result will only return a small subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored before performance suffers.

In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite columns are a better solution, and they work without the added complexity.

The easiest way to make use of composite columns is to take advantage of the abstraction that CQL 3 provides. Consider the following schema:

CREATE TABLE messages(     username text,     sent_at timestamp,     message text,     sender text,     PRIMARY KEY(username, sent_at) ); 

Username here is the row key, but we've used a PRIMARY KEY definition which creates a grouping of row key and the sent_at column. This is important as it has the effect of indexing that attribute.

INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:42:15', 'Hi', 'alice'); INSERT INTO messages (username, sent_at, message, sender) VALUES ('alice', '2012-08-01 11:42:37', 'Hi yourself', 'bob'); INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:43:00', 'What are you doing later?', 'alice'); INSERT INTO messages (username, sent_at, message, sender) VALUES ('bob', '2012-08-01 11:47:14', 'Bob?', 'alice'); 

Behind the scenes Cassandra will store the above inserted data something like this:

alice: (2012-08-01 11:42:37,message): Hi yourself, (2012-08-01 11:42:37,sender): bob bob:   (2012-08-01 11:42:15,message): Hi,          (2012-08-01 11:42:15,sender): alice, (2012-08-01 11:43:00,message): What are you doing later?, (2012-08-01 11:43:00,sender): alice (2012-08-01 11:47:14,message): Bob?, (2012-08-01 11:47:14,sender): alice 

But using CQL 3, we can query the "row" using a sent_at predicate, and get back a tabular result set.

SELECT * FROM messages WHERE username = 'bob' AND sent_at > '2012-08-01';  username | sent_at                  | message                   | sender ----------+--------------------------+---------------------------+--------       bob | 2012-08-01 11:43:00+0000 | What are you doing later? |  alice       bob | 2012-08-01 11:47:14+0000 |                      Bob? |  alice 
like image 55
jericevans Avatar answered Sep 27 '22 17:09

jericevans