Data Versioning in Cassandra with CQL3

Tags:

I am quite a n00b in Cassandra (I'm mainly from an RDBMS background with some NoSQL here and there, like Google's BigTable and MongoDB) and I'm struggling with the data modelling for the use cases I'm trying to satisfy. I looked at this and this and even this but they're not exactly what I needed.

I have this basic table:

CREATE TABLE documents (
    itemid_version text,       
    xml_payload text,
    insert_time timestamp,
    PRIMARY KEY (itemid_version)
);

itemid is actually a UUID (and unique for all documents), and version is an int (version 0 is the "first" version). xml_payload is the full XML doc, and can get quite big. Yes, I'm essentially creating a versioned document store.

As you can see, I concatenated the two to create a primary key and I'll get to why I did this later as I explain the requirements and/or use cases:

user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)
user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version
user needs the version history of a single (1) doc.
user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)

I will be writing the client code that will perform the use cases, please excuse the syntax as I'm trying to be language-agnostic

first one's straightforward:

$itemid_version = concat($itemid, $version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

now to satisfy the 2nd and 3rd use cases, I am adding the following table:

CREATE TABLE document_versions (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
) WITH clustering order by (version DESC);

new records will be added as new docs and new versions of existing docs are created

now we have this (use case #2):

$latest_itemid, $latest_version = csql("select itemid, 
    version from document_versions where item_id = {0} 
    order by version DESC limit 1;" -f $itemid)
$itemid_version = concat($latest_itemid, $latest_version)
$doc = csql("select * from documents where itemid_version = {0};" 
    -f $itemid_version)

and this (use case #3):

$versions = csql("select version from document_versions where item_id = {0}" 
    -f $itemid)

for the 3rd requirement, I am adding yet another table:

CREATE TABLE latest_documents (
    itemid uuid,
    version int,
    PRIMARY KEY (itemid, version)
)

records are inserted for new docs, records are updated for existing docs

and now we have this:

$latest_itemids, $latest_versions = csql("select itemid, version 
    from latest_documents where item_id in ({0})" -f $itemid_list.toCSV())

foreach ($one_itemid in $latest_itemids, $one_version in $latest_versions)
    $itemid_version = concat($latest_itemid, $latest_version)
    $latest_docs.append(
        cql("select * from documents where itemid_version = {0};" 
        -f $itemid_version))

Now I hope it's clear why I concatenated itemid and version to create an index for documents as opposed to creating a compound key: I cannot have OR in the WHERE clause in SELECT

You can assume that only one process will do the inserts/updates so you don't need to worry about consistency or isolation issues.

Am I on the right track here? there are quite a number of things that doesn't sit well with me...but mainly because I don't understand Cassandra yet:

I feel that the primary key for documents should be a composite of (itemid, version) but I can't satisfy use case #4 (return a list from a query)...I can't possibly use a separate SELECT statement for each document due to the performance hit (network overhead)...or can (should) I?
2 trips to get a document if the version is not known beforehand. probably a compromise I have to live with, or maybe there's a better way.

573

asked May 21 '14 14:05

Dexter Legaspi

2 Answers

How would this work Dexter?

It is actually very similar to your solution actually except you can store all versions and be able to fetch the 'latest' version just from one table (document_versions).

In most cases I think you can get what you want in a single SELECT except use case #2 where fetching the most recent version of a document where a pre SELECT is needed on document_versions first.

SECOND ATTEMPT

(I removed the code from the first attempt, apologies to anyone who was following in the comments).

CREATE TABLE documents (
        itemid_version text,
        xml_payload text,
        insert_time timestamp,
        PRIMARY KEY (itemid_version)
);

CREATE TABLE document_versions (
        itemid text,
        version int,
        PRIMARY KEY (itemid, version)
) WITH CLUSTERING ORDER BY (version DESC);


INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc1-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-1', '<?xml>1st</xml>', '2014-05-21 18:00:00');
INSERT INTO documents (itemid_version, xml_payload, insert_time) VALUES ('doc2-2', '<?xml>2nd</xml>', '2014-05-21 18:00:00');

INSERT INTO document_versions (itemid, version) VALUES ('doc1', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc1', 2);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 1);
INSERT INTO document_versions (itemid, version) VALUES ('doc2', 2);

user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)

SELECT * FROM documents WHERE itemid_version = 'doc1-2';
user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version (You would feed concatenated itemid + version in result of first query into second query)

SELECT * FROM document_versions WHERE itemid = 'doc2' LIMIT 1;

SELECT * FROM documents WHERE itemid_version = 'doc2-2';
user needs the version history of a single (1) doc.

SELECT * FROM document_versions WHERE itemid = 'doc2';
user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)

SELECT * FROM documents WHERE itemid_version IN ('doc1-2', 'doc2-1');

Cheers,

124

answered Sep 21 '22 12:09

reggoodwin

Lets see if we can come up with a model in a top down fashion starting from your queries:

CREATE TABLE document_versions (
  itemid uuid,
  name text STATIC,
  vewrsion int,   
  xml_payload text,
  insert_time timestamp,
  PRIMARY KEY ((itemid), version)
) WITH CLUSTERING ORDER BY (version DESC);

Use case 1: user needs to get the single (1) doc he wants, he knows the item id and version (not necessarily the latest)
```
SELECT * FROM document_versions 
  WHERE itemid = ? and version = ?;
```
Use case 2: user needs to get the single (1) doc he wants, he knows the item id but does not know the latest version
```
SELECT * FROM document_versions
  WHERE itemid = ? limit 1;
```
Use case 3: user needs the version history of a single (1) doc.
```
SELECT * FROM document_versions 
  WHERE itemid = ?
```
Use case 4: user needs to get the list (1 or more) of docs he wants, he knows the item id AND version (not necessarily the latest)
```
SELECT * FROM documents 
  WHERE itemid = 'doc1' and version IN ('1', '2');
```
One table for all these queries is the correct approach. I would suggest taking the Datastax free online course: DS220 Data Modeling

answered Sep 20 '22 12:09

Igor Zelaya

Related questions
                            
                                Cassandra - how to retrieve all keys in a CF (Random Partitioner)
                            
                                Keyspace schema import and export in Cassandra
                            
                                CQL 3 Cassandra 1.2 counter: how to insert primary key?
                            
                                Using Cassandra and CQL3, how do you insert an entire wide row in a single request?
                            
                                clojure-variable-names for database_column_names
                            
                                data auditing in Cassandra
                            
                                Apache Cassandra CQL queries explain plan
                            
                                How will i know that record was duplicate or it was inserted successfully?
                            
                                Cannot connect java client to cassandra with password authentication enabled
                            
                                Cassandra replication factor greater than number of nodes
                            
                                Why doesn't my Cassandra update work?
                            
                                Performance - TTL vs Deleting a row in Cassandra
                            
                                How can I add a new field in a Cassandra table with default value "whatever"?
                            
                                How default_time_to_live would delete rows without tombstones in Cassandra?
                            
                                Preventing Cassandra from dumping hprof files
                            
                                What are the implications of using a low GCGraceSeconds value in Cassandra?
                            
                                Cassandra (CQL) select statement with 'where' is not working
                            
                                Cassandra: what is the correct configuration for EC2 multi-region?
                            
                                cassandra primary key. performance implications if integer vs varchar
                            
                                Simple example of how to use the Cassandra database (cql) and node.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data Versioning in Cassandra with CQL3

Tags:

cassandra

cassandra-2.0

Dexter Legaspi

People also ask

2 Answers

reggoodwin

Igor Zelaya

Recent Activity

Donate For Us