Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between a clustering column and secondary index in cassandra

Tags:

I'm trying to understand the difference between these two and the scenarios in which you would prefer to use one over the other.

My specific use case is using cassandra as an event ingestion system backed by an analytics engine that interprets the event.

My model includes

  • event id (the partition key)
  • event time (a clustering column)
  • event type (i'm not sure whether to use clustering column or secondary index)

I figure the most common read scenario will be to get the events over a time range hence event time is the clustering column. A less frequent read scenario might involve further filtering the event query by event type.

like image 635
Josh Avatar asked Jul 08 '14 01:07

Josh


People also ask

What is the difference between a clustering index and a secondary index?

Secondary Index − Secondary index may be generated from a field which is a candidate key and has a unique value in every record, or a non-key with duplicate values. Clustering Index − Clustering index is defined on an ordered data file. The data file is ordered on a non-key field.

What is secondary index in Cassandra?

4. Secondary Indexes. Secondary Indexes in Cassandra solve the need for querying columns that are not part of the primary key. When we insert data, Cassandra uses an append-only file called commitlog for storing the changes, so writes are quick.

What is a clustering column in Cassandra?

Any fields listed after the partition key are called clustering columns. These store data in ascending or descending order within the partition for the fast retrieval of similar values. All the fields together are the primary key.


1 Answers

A secondary index is pretty similar to what we know from regular relational databases. If you have a query with a where clause that uses column values that are not part of the primary key, lookup would be slow because a full row search has to be performed. Secondary indexes make it possible to service such queries efficiently. Secondary indexes are stored as extra tables, and just store extra data to make it easy to find your way in the main table.

So that's a good ol' index, which we already know about. So far, there's nothing new to cassandra and its distributed nature.

Partitioning and clustering is all about deciding how rows from the main table are spread among the nodes. This is unique to cassandara since it determines the distribution of data. So, the primary key consists of at least one column. The first column in the primary key is used as the partition key. The partition key is used to decide which node to store a row. If the primary key has additional columns, the columns are used to cluster the data on a given node - the data is stored in lexicographic order on a node by clustering columns.

This question has more specifics on clustering columns: Clustering Keys in Cassandra

So an index on a given column X makes the lookup X --> primary key efficient. The partition key (first column in the primary key) determines which node a row is stored on. Clustering columns (additional columns in the primary key) determine which order rows are stored in on their assigned node.

So your intuition sounds about right - the event ID is presumably guaranteed unique, so is great for building a primary key. Event time is a great way to order rows on disk on a given node.

If you never needed to lookup data by event type, eg, never had a query like SELECT * FROM Events WHERE Type = Warning, then you have no need for your additional indexes, but your demands for partitioning don't change. Indexes make it easy to serve queries with different predicates. Since you mentioned that you indeed were planning on performing queries like that, you do in fact likely want an index on your EventType column.

Check out the cassandra documentation: http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_compound_keys_c.html

Cassandra uses the first column name in the primary key definition as the partition key.
...
In the case of the playlists table, the song_order is the clustering column. The data for each partition is clustered by the remaining column or columns of the primary key definition. On a physical node, when rows for a partition key are stored in order based on the clustering columns

like image 50
antiduh Avatar answered Sep 22 '22 07:09

antiduh