Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.

According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.

In my opinion, this is strictly row-oriented. Is there something I'm missing?

like image 552
cesare Avatar asked Oct 22 '12 11:10

cesare


People also ask

Why Cassandra is called columnar database?

Apache Cassandra Cassandra is an open source, column-oriented database designed to handle large amounts of data across many commodity servers. Unlike a table in a relational database, different rows in the same table (column family) do not have to share the same set of columns.

Is Cassandra a columnar database or key-value?

Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables.

Is Cassandra a column family database?

A Cassandra column family consists of a collection of ordered columns in rows which represent a structured version of the stored data. The keyspace holds these Cassandra column families and each keyspace has at least one column family.

Is Cassandra columnar?

Cassandra, on the other hand, is a columnar NoSQL database, storing data in columns instead of rows. A column in a Cassandra database contains three fields: the name of the column or key, the value against the key, and a time stamp.


2 Answers

  • If you take a look at the Readme file at Apache Cassandra git repo, it says that,

Cassandra is a partitioned row store. Rows are organized into tables with a required primary key.

Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically repartition as machines are added and removed from the cluster.

Row store means that like relational databases, Cassandra organizes data by rows and columns.

  • Column oriented or columnar databases are stored on disk column wise.

    e.g: Table Bonuses table

      ID         Last    First   Bonus   1          Doe     John    8000   2          Smith   Jane    4000   3          Beck    Sam     1000 
  • In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;

  • In a column-oriented database management system, the data would be stored like this:
    1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;

  • Cassandra is basically a column-family store

  • Cassandra would store the above data as,

     "Bonuses" : {            row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},            row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}            ...      } 
  • Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.

  • Read this for more details.

like image 117
tharindu_DG Avatar answered Oct 18 '22 01:10

tharindu_DG


Yes, the "column-oriented" terminology is a bit confusing.

The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.

So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.

apple -> colour  weight  price variety          "red"   100     40    "Cox"  orange -> colour    weight  price  origin           "orange"  120     50     "Spain" 

One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.

However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):

temperature ->  2012-09-01  2012-09-02  2012-09-03 ...                 40          41          39         ... 

which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".

like image 28
DNA Avatar answered Oct 18 '22 02:10

DNA