Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient modeling of versioned hierarchies in Cassandra

Disclaimer:
This is quite a long post. I first explain the data I am dealing with, and what I want to do with it.
Then I detail three possible solutions I have considered, because I've tried to do my homework (I swear :]). I end up with a "best guess" which is a variation of the first solution.

My ultimate question is: what's the most sensible way to solve my problem using Cassandra? Is it one of my attempts, or is it something else?
I am looking for advice/feedback from experienced Cassandra users...

My data:
I have many SuperDocuments that own Documents in a tree structure (headings, subheadings, sections, …).

Each SuperDocument structure can change (renaming of headings mostly) over time, thus giving me multiple versions of the structure as shown below.

superdocument versions

What I'm looking for:
For each SuperDocument I need to timestamp those structures by date as above and I'd like, for a given date, to find the closest earlier version of the SuperDocument structure. (ie. the most recent version for which version_date < given_date)

These considerations might help solving the problem more easily:

  • Versions are immutable: changes are rare enough, I can create a new representation of the whole structure each time it changes.
  • I do not need to access a subtree of the structure.
  • I'd say it is OK to say that I do not need to find all the ancestors of a given leaf, nor do I need to access a specific node/leaf inside the tree. I can work all of this out in my client code once I have the whole tree.

OK let's do it
Please keep in mind I am really just starting using Cassandra. I've read/watched a lot of resources about data modeling, but haven't got much (any!) experience in the field!
Which also means everything will be written in CQL3... sorry Thrift lovers!

My first attempt at solving this was to create the following table:

CREATE TABLE IF NOT EXISTS superdoc_structures (
    doc_id varchar,
    version_date timestamp,
    pre_pos int,
    post_pos int,
    title text,

    PRIMARY KEY ((doc_id, version_date), pre_pos, post_pos)

) WITH CLUSTERING ORDER BY (pre_pos ASC);

That would give me the following structure:

enter image description here

I'm using a Nested Sets model for my trees here; I figured it would work well to keep the structure ordered, but I am open to other suggestions.

I like this solution: each version has its own row, in which each column represents a level of the hierarchy.
The problem though is that I (candidly) intended to query my data as follows:

SELECT * FROM superdoc_structures 
    WHERE doc_id="3399c35...14e1" AND version_date < '2014-03-11' LIMIT 1

Cassandra quickly reminded me I was not allowed to do that! (because the partitioner does not preserve row order on the cluster nodes, so it is not possible to scan through partition keys)

What then...?
Well, because Cassandra won't let me use inequalities on partition keys, so be it!
I'll make version_date a clustering key and all my problems will be gone. Yeah, not really...

First try:

CREATE TABLE IF NOT EXISTS superdoc_structures (
    doc_id varchar,
    version_date timestamp,
    pre_pos int,
    post_pos int,
    title text,

    PRIMARY KEY (doc_id, version_date, pre_pos, post_pos)

) WITH CLUSTERING ORDER BY (version_date DESC, pre_pos ASC);

I find this one less elegant: all versions and structure levels are made into columns of a now very wide row (compared to my previous solution):

second modeling attempt

Problem: with the same request, using LIMIT 1 will only return the first heading. And using no LIMIT would return all versions structure levels, which I would have to filter to only keep the most recent ones.

Second try:

there's no second try yet... I have an idea though, but I feel it's not using Cassandra wisely.

The idea would be to cluster by version_date only, and somehow store whole hierarchies in each column values. Sounds bad doesn't it?

I would do something like this:

CREATE TABLE IF NOT EXISTS superdoc_structures (
    doc_id varchar,
    version_date timestamp,
    nested_sets map<int, int>,
    titles list<text>,

    PRIMARY KEY (doc_id, version_date)

) WITH CLUSTERING ORDER BY (version_date DESC);

The resulting row structure would then be:

third modeling attempt

It looks kind of all right to me in fact, but I will probably have more data than the level title to de-normalize into my columns. If it's only two attributes, I could go with another map (associating titles with ids for instance), but more data would lead to more lists, and I have the feeling it would quickly become an anti-pattern.
Plus, I'd have to merge all lists together in my client app when the data comes in!

ALTERNATIVE & BEST GUESS
After giving it some more thought, there's an "hybrid" solution that might work and may be efficient and elegant:

I could use another table that would list only the version dates of a SuperDocument & cache these dates into a Memcache instance (or Redis or whatever) for real quick access.
That would allow me to quickly find the version I need to fetch, and then request it using the composite key of my first solution.

That's two queries, plus a memory cache store to manage. But I may end up with one anyway, so maybe that'd be the best compromise?
Maybe I don't even need a cache store?

All in all, I really feel the first solution is the most elegant one to model my data. What about you?!

like image 218
Olivier Lance Avatar asked Aug 22 '14 14:08

Olivier Lance


People also ask

What is a Cassandra data model?

This article will look at the Cassandra Data Model, which is significantly different from what we see in RDBMS. In simple words, a Data model is the logical structure of a database. It describes how data is stored and accessed and the relationships among different types of data.

Why is it important to understand the core concepts of Cassandra?

Understanding the core concepts and identifying the query patterns in advance is necessary for designing a correct data model that gets the best performance from a Cassandra cluster.

Why are some query patterns not efficient in Cassandra?

Due to the way that Cassandra stores data, some query patterns are not at all efficient, including the following:

What is Cassandra partitioning?

Cassandra is a distributed database in which data is partitioned and stored across multiple nodes within a cluster. The partition key is made up of one or more data fields and is used by the partitioner to generate a token via hashing to distribute the data uniformly across a cluster.


1 Answers

First, you don't need to use memcache or redis. Cassandra will give you very fast access to that information. You could certainly have a table that was something like:

create table superdoc_structures {
    doc_id varchar;
    version_date timestamp;
    /* stuff */
    primary key (doc_id, version_date)
} with clustering order by (version_date desc);

which would give you a quick way to access a given version (this query may look familiar ;-):

select * from superdoc_structures 
    where doc_id="3399c35...14e1" and
        version_date < '2014-03-11'
    order by version_date desc
    limit 1;

Since nothing about the document tree structure seems to be relevant from the schema's point of view, and you are happy as a clam to create the document in its entirety every time there is a new version, I don't see why you'd even bother breaking out the tree in to separate rows. Why not just have the entire document in the table as a text or blob field?

create table superdoc_structures {
    doc_id varchar;
    version_date timestamp;
    contents text;
    primary key (doc_id, version_date)
} with clustering order by (version_date desc);

So to get the contents of the document as existed at the new year, you'd do:

select contents from superdoc_structures
where doc_id="...." and 
    version_date < '2014-01-1'
order by version_date > 1

Now, if you did want to maintain some kind of hierarchy of the document components, I'd recommend doing something like a closure table table to represent it. Alternatively, since you are willing to copy the entire document on each write anyway, why not copy the entire section info on each write, why not do so and have a schema like:

create table superdoc_structures {
    doc_id varchar;
    version_date timestamp;
    section_path varchar;
    contents text;
    primary key (doc_id, version_date, section_path)
) with clustering order by (version_date desc, section_path asc);

Then have section path have a syntax like, "first_level next_level sub_level leaf_name". As a side benefit, when you have the version_date of the document (or if you create a secondary index on section_path), because a space is lexically "lower" than any other valid character, you can actually grab a subsection very cleanly:

select section_path, contents from superdoc_structures
where doc_id = '....' and
    version_date = '2013-12-22' and
    section_path >= 'chapter4 subsection2' and
    section_path < 'chapter4 subsection2!';

Alternatively, you can store the sections using Cassandra's support for collections, but again... I'm not sure why you'd even bother breaking them out as doing them as one big chunk works just great.

like image 185
Christopher Smith Avatar answered Oct 18 '22 16:10

Christopher Smith