Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing a list of values in Cassandra

Tags:

cassandra

Version Dependent

Some of the answers to this question deal with older versions of Cassandra. The correct answer for this kind of problem depends on the version of Cassandra you are using.


I have a profile column family and want to store a list of skills in each profile. I'm not sure how this is typically accomplished in Cassandra. One option would be to store a serialized Thrift or protobuf, but I'd prefer not to do this as I believe Cassandra doesn't have knowledge of these formats, and so the data in the datastore would not not human readable or queryable via CQL from the command line. The other solution I thought of would be to use a super column and put the skill as the key with a null value:

skills: {
  'java': '',
  'c++': '',
  'cobol': ''
}

Is this a good way of handling lists in Cassandra? I imagine there's some idiom I'm not aware of. I'm using the Astyanax client library, which only supports composite columns instead of super columns, and so the solution I proposed above would seem quite awkward in that case. Though I'm still having some trouble understanding composite columns as they seem not to be completely documented yet. Would this solution work with composite columns?

like image 431
Ben McCann Avatar asked Mar 26 '12 14:03

Ben McCann


3 Answers

This answer dates to before the release of Cassandra version 1.2, which provided substantially different functionality for handling lists. The answer might be inappropriate if you are using Cassandra 1.2+.


I would encode lists in the column key, using composite columns with the real column name as the first dimension, ie:

row_key -> {
     [column_name; entry1] -> "",
     [column_name; entry2] -> "",
     ... 
}

Then, to read the list, you would need to do a get_slice from [column_name; ] to [column_name; ] - note the empty dimensions.

The great thing about this is it actually implements a set quite nicely; the list cannot contains the same thing twice. I think thins works in your usecase. The list would also be maintained in sorted order.

like image 113
tom.wilkie Avatar answered Nov 15 '22 12:11

tom.wilkie


This answer dates to before the release of Cassandra version 1.2, which provided substantially different functionality for handling lists. The answer might be inappropriate if you are using Cassandra 1.2+.


As mentioned on the mailing list, my preference which has worked very well for me, is to store a single column "skills" with the value being a serialized JSON string.

Really comes down to the usage patterns you have for "skills".

  • If "skills" are just for CRUD on a per user basis, this is fine.
  • If you want to be able to search for all users that have a skill of "cobol", then I would still recommend this approach and have another row that is skill:cobol that has a column of UUID and a value of timestamp or something similar ...
  • I'm sure with Pig/Hadoop integration to your cassandra nodes, you could also still quite happily query all of the users that have x,y and z to generate new data to support additional use cases.
like image 35
sdolgy Avatar answered Nov 15 '22 12:11

sdolgy


In older versions of Cassandra, you had to serialize the list yourself and store it in a column, or perhaps use a super column.

Since version 1.2 of Cassandra, CQL3 has collection types for columns, so you can give list<text> as the type of a column in your schema. For example:

 CREATE TABLE Person (
    name text,
    skills list<text>,
    PRIMARY KEY (name)
 );

Or you could use set<text> if you want to automatically eliminate duplicates.

like image 22
Raedwald Avatar answered Nov 15 '22 10:11

Raedwald