Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

Question

I am building a database, with the following characteristics:

Schemaless database with a variable number of columns for each row.
Tens of millions of records and tens of columns.
Millions queries per day.
Thousands writes per day.
Queries will be filtering on several columns (not only the key).

I am considering Cassandra which is built-to-scale.

My questions are:

Do I need to scale horizontally in this case?
Does Cassandra support having several keys to point to the same column-family?

EDIT

I would like to make sure that I got your point right. So, the following example puts down what I got from your answer:

So, if we have the following column family (it holds some store products and their details)

products // column-family name
{
x = {   "id":"x", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"15 inch",
    "OS":"Windows"}
y = {   "id":"y", // this is unique id for the row. 
    "name":"Laptop",
    "screen":"17 inch"}
z = {   "id":"z", // this is unique id for the row. 
    "name":"Printer",
    "page per minute":"20 pages"}
}

And, we want to add "name" search parameter, we will create another copy of the CF with different row keys as the following:

products
{
"x:name:Laptop"  = {    "id":"x", 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:name:Laptop"  = {    "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
"z:name:Printer" = {    "id":"z", 
            "name":"Printer",
            "ppm":"20 pages"}
}

And similarly, in order to add the "screen" search parameter:

products
{
"x:screen:15 inch" = {  "id":"x" 
            "name":"Laptop",
            "screen":"15 inch",
            "OS":"Windows"}
"y:screen:17 inch" = {  "id":"y", 
            "name":"Laptop",
            "screen":"17 inch"}
}

But, if we would like to make a query based on 10 search parameters or any combination of them (as the case in my application), then we would have to create 1023 copies of the column family [(2 to the power 10)-1]. And since most of the rows will have many of the search parameters, this means that we need about 1000 times extra storage to model the data (in this way), which is not little, especially if we have 10,000,000 rows in the original CF.

Is this the data model you suggested?

Another point: I don't manage to see exactly why creating secondary indexes would forfeit or deprive the schemaless model.

le-doude · Accepted Answer

Cassandra is not a db you can query by anything other than the row key. But you can tailor your datamodel to support those queries.

We do 175,000,000 queries a day on our 6 cassandra nodes cluster (easy!) but we only ask for data using row_keys and columns because we have made our datamodel to work that way. We do not use indexed queries.

To support richer queries we denormalize our data using the data we will use as search parameters for making the keys to retrieve the data.

Example: Consider we save the following object:

obj {
   id : xxx //assuming id is a unique id across the system
   p1 : value1
   p2 : value2
}

And we know we want to search by any of those parameters then we will save a copy of obj for column_names or keys as follows:

"p1:value1:xxx"
"p2:value2:xxx"
"p1:value1:p2:value2:xxx" 
"xxx"

This way we can search for obj with p1 = value1, p2 =value2, p1 = value1 AND p2 = value2 or by just it's unique id xxx.

The only other option if you do not want to do that is to use Secondary indexes and indexed queries but that would forfeit the "schema-less" requirement of your question.

EDIT - An example.

We want to save objects "Products" defined as

class Products{
    string uid;
    string name;
    int screen_size; //in inches
    string os;
    string brand;
}

We serialize it into a string or byteArray (I always have the tendency of using Jackson Json or Protobuf ... both work very well with cassandra and are super fast). We put that byte array into a column.

Now the important part : creating the column names and the row keys. Let's say we want to search by screen resolutions and possibly filter by brands. We define buckets for the screen size as ["0_to15", "16_to_21", "21_up"]

given column :

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}

one copy get saved with: - key = "brand:Samsung" and column_name = "screen_size:15_uid:MI615FMDO548" - key = "brand:0_to_15" and column_name = "screen_size:15_uid:MI615FMDO548"

Why do I add the uid to the column name? To make all column names unique for unique products.

Example part 2 Now lets say we have added

"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"

We will end up with the following column family:

Products{
-Row:"brand:Samsung"
=> "screen_size:13_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:14_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "screen_size:15_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
=> "screen_size:17_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:0_to_15"
=> "brand:Samsung_uid:MI615FMDO687":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
=> "brand:Samsung_uid:MI615FMD5589":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
=> "brand:Samsung_uid:MI615FMDO548":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"screen_size:16_to_17"
=> "brand:Samsung_uid:MI615FMD1111":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMDO687"
=> "product":"{uid:"MI615FMDO687", name:"SFG-0095", screen_size:13, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD5589"
=> "product":"{uid:"MI615FMD5589", name:"SFG-0097", screen_size:14, os:"Android JellyBean", brand:"Samsung"}
-Row:"uid:MI615FMDO548"
=> "product":"{uid:"MI615FMDO548", name:"SFG-0098", screen_size:15, os:"Android JellyBean", brand:"Samsung"}"
-Row:"uid:MI615FMD1111"
=> "product":"{uid:"MI615FMD1111", name:"SFG-0098", screen_size:17, os:"Android JellyBean", brand:"Samsung"}"
}

Now by using range queries across column names you can search by brand and by screen size.

hope this was useful

Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

Tags:

cassandra

Ababneh A

1 Answers

le-doude

Recent Activity

Donate For Us

Cassandra for a schemaless db, 10's of millions order tables and millions of queries per day

Tags:

cassandra

Ababneh A

1 Answers

le-doude

Related questions

Recent Activity

Donate For Us