Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to implement data versioning in Cassandra

Tags:

Can you share your thoughts how would you implement data versioning in Cassandra.

Suppose that I need to version records in an simple address book. (Address book records are stored as Rows in a ColumnFamily). I expect that the history:

  • will be used infrequently
  • will be used all at once to present it in a "time machine" fashion
  • there won't be more versions than few hundred to a single record.
  • history won't expire.

I'm considering the following approach:

  • Convert the address book to Super Column Family and store multiple version of address book records in one Row keyed (by time stamp) as super columns.

  • Create new Super Column Family to store old records or changes to the records. Such structure would look as follows:

    { 'address book row key': { 'time stamp1': { 'first name': 'new name', 'modified by': 'user id', },

    'time stamp2': {
            'first name': 'new name',
            'modified by': 'user id',
        },
    },
    

    'another address book row key': { 'time stamp': { ....

  • Store versions as serialized (JSON) object attached in new ColumnFamilly. Representing sets of version as rows and versions as columns. (modelled after Simple Document Versioning with CouchDB)

like image 950
Piotr Czapla Avatar asked Nov 15 '10 11:11

Piotr Czapla


People also ask

How do nodes share and update data with each other in Cassandra?

In Cassandra all nodes communicating with each other via a gossip protocol. Gossip is the message system that Cassandra node use to make their data consistent with each other.

How does update work in Cassandra?

Cassandra treats each new row as an upsert: if the new row has the same primary key as that of an existing row, Cassandra processes it as an update to the existing row. During a write, Cassandra adds each new row to the database without checking on whether a duplicate record exists.

What is the most important design decision in Cassandra?

With Cassandra, an important goal of the design is to optimize how data is distributed around the cluster. Sorting is a Design Decision: In Cassandra, sorting can be done only on the clustering columns specified in the PRIMARY KEY.

What is meant by data versioning?

In the case of research data, a new version of a dataset may be created when an existing dataset is reprocessed, corrected or appended with additional data. Versioning is one means by which to track changes associated with 'dynamic' data that is not static over time.


2 Answers

If you can add the assumption that address books typically have fewer than 10,000 entries in them, then using one row per address book time line in a super column family would be a decent approach.

A row would look like:

{'address_book_18f3a8':
  {1290635938721704: {'entry1': 'entry1_stuff', 'entry2': 'entry2_stuff'}},
  {1290636018401680: {'entry1': 'entry1_stuff_v2', ...},
  ...
}

where the row key identifies the address book, each super column name is a time stamp, and the subcolumns represent the address book's contents for that version.

This would allow you to read the latest version of an address book with only one query and also write a new version with a single insert.

The reason I suggest using this if address books are less than 10,000 elements is that super columns must be completely deserialized when you read even a single subcolumn. Overall, not that bad in this case, but it's something to keep in mind.

An alternative approach would be to use a single row per version of the address book, and use a separate CF with a time line row per address book like:

{'address_book_18f3a8': {1290635938721704: some_uuid1, 1290636018401680: some_uuid2...}}

Here, some_uuid1 and some_uuid2 correspond to the row key for those versions of the address book. The downside to this approach is that it requires two queries every time the address book is read. The upside is that it lets you efficiently read only select parts of an address book.

like image 160
Tyler Hobbs Avatar answered Sep 22 '22 16:09

Tyler Hobbs


HBase(http://hbase.apache.org/) has this functionality built in. Give it a try.

like image 31
azi Avatar answered Sep 24 '22 16:09

azi