Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice modeling data for Cassandra databases

I'm new to Cassandra and am looking for a best practice on how to model data that has this general following structure:

The data is "user" based (per customer) , each is supplying a big data file of around 500K-2M entries (periodically updated a few times a day - sometimes full update and sometimes only deltas)

Each data file has certain mandatory data fields (~20 mandatory) but can add additional columns at their discretion (up to ~100).

The additional data fields are NOT necessarily the same for the different users (the names of the fields or the types of those fields)

Example (csv format:)

user_id_1.csv

| column1 (unique key per user_id)  |  column2  |  column3 |   ...   |  column10  |  additionalColumn1  |  ...additionalColumn_n |
|-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------|
| user_id_1_key_1                   |  value    |  value   |  value  |  value     |                ...  |  value                 |
| user_id_1_key_2                   |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |
| ....                              |  ...      |  ...     |  ...    |  ...       |                ...  |  ...                   |
| user_id_1_key_2Million            |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |


user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types)

|             column1 (unique key per user_id)              |  column2  |  column3 |   ...   |  column10  |  additionalColumn1 (different types than user_id_1 and others)  |  ...additional_column_x |
|-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------|
| user_id_XXX_key_1                                         |  value    |  value   |  value  |  value     |                                                            ...  |  value                  |
| user_id_XXX_key_2                                         |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |
| ....                                                      |  ...      |  ...     |  ...    |  ...       |                                                            ...  |  ...                    |
| user_id_XXX_key_500_thousand (less rows than other user)  |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |

Several options I have considered:

Option 1:

  1. Create a "global" keyspace
  2. Create a big table "data" containing everything
  3. Concatenate a user_id column to all of the other columns to the big table (including the non-mandatory columns). The primary key becomes user_id + "column_1" (column_1 is unique per user_id)

                                     Keyspace
    +--------------------------------------------------------------------------+
    |                                                                          |
    |                                                                          |
    |                                      Data_Table                          |
    |                +  +--------+-------+--------------------------+-----+    |
    |                |  |        |       |                          |     |    |
    |                |  +-------------------------------------------------+    |
    |                |  |        |       |                          |     |    |
    |    many rows   |  +-------------------------------------------------+    |
    |                |  |        |       |                          |     |    |
    |                |  |        |       |                          |     |    |
    |                |  |        |       |                          |     |    |
    |                |  |        |       |     Many columns         |     |    |
    |                |  |        |       +------------------------> |     |    |
    |                |  |        |       |                          |     |    |
    |                |  +-------------------------------------------------+    |
    |                v  +-------------------------------------------------+    |
    |                                                                          |
    +--------------------------------------------------------------------------+
    

A few things that I notice right away:

  1. The user_id repeats itself as many times as entries per user
  2. The rows are very sparse for the additional columns (empty null values) since the users don't necessarily share them
  3. Number of users is relatively small so number of additional columns is not huge (10K columns max)
  4. I could compact the additional columns data per user to one column called "meta data" and share it per all user

Option 2:

Create Keyspace per User_id

Create table "data" per keyspace

+-----------------------------------------------------------------------------------+
| column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n  |
+-----------------------------------------------------------------------------------+

keyspace_user1         keyspace_user2                     keyspace_user_n
+----------------+    +---------------+                  +---------------+
|                |    |               |                  |               |
|                |    |               |                  |               |
|   +-+-+--+-+   |    |    +-+--+--+  |                  |   +--+--+---+ |
|   | | |  | |   |    |    | |  |  |  |   many keyspaces |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  | +------------->  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   +--------+   |    |    +-------+  |                  |   +---------+ |
+----------------+    +---------------+                  +---------------+

notes:

  1. Many keyspaces (keyspace per user)
  2. Avoids adding "user_id" value per each row (I can use the key space name as the user id)
  3. Very few tables per keyspace (in this example only 1 table per keyspace)

Option 3:

1) Create a global keyspace 2) Create a table per user_id (the mandatory columns as well as their additional columns per their table)

+---------------------------------------------------------------+
|                            Keyspace                           |
|                                                               |
|       user_1        user_2                         user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |
|                                                               |
+---------------------------------------------------------------+

Notes

  1. Global keyspace
  2. A table per user_id ("many" tables)
  3. Avoids duplicating the user id per row

Option 4: (Does this make sense?)

Create a multiple keyspaces (for instance "x" number of keyspaces) each holding a range of tables (table per user)

                      keyspace_1                                                                                keyspace_x
+---------------------------------------------------------------+                         +---------------------------------------------------------------+
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
|       user_1        user_2                        user_n/x    |                         |     user_n-x      user_n-x+1                       user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+------+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |        "X" keyspaces    |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   | +---------------------> |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
+---------------------------------------------------------------+                         +---------------------------------------------------------------+

Notes:

  1. Multiple keyspaces
  2. Multiple tables per user
  3. Requires a "lookup" to figure out which keyspace contains the required table

Option 5:

Split data to multiple tables and multiple keyspaces

Notes: 1. Requires "joining" information from multiple tables in some cases 2. Seems to be more complicated


General notes for all scenarios:

  1. There are a magnitude less writes than reads
  2. Many millions of reads per day
  3. Traffic fluctuates per user_id - some user_ids have a lot of traffic and some user_ids have much less traffic . Would need to tune per this metric
  4. Some user_ids are updated (writes) more frequently than others
  5. We have multiple data centers across geographies and should sync
  6. There is a long tail per primary key (some keys are accessed many times while other keys are rarely accessed)
like image 576
Avner Barr Avatar asked Sep 25 '17 09:09

Avner Barr


People also ask

What type of data model does Cassandra use?

Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families.

What goals should be set by a developer for data modeling in Cassandra?

The goals of a successful Cassandra Data Model are to choose a partition key that (1) distributes data evenly across the nodes in the cluster; (2) minimizes the number of partitions read by one query, and (3) bounds the size of a partition.


1 Answers

This type of integration challenge is usually solved by an EAV (Entity Attribute Value) data model in relational systems (like the one Ashrafaul demonstrates). The key consideration when considering an EAV model is an unbounded number of columns. An EAV data model may, of course, be mimicked in a CQL system like Cassandra or ScyllaDB. The EAV model lends itself nicely to writes but presents challenges when reading. You haven't really detailed your read considerations. Do you need all columns back or do you need specific columns back per user?

Files

Having said that, there are some further considerations inherent to Cassandra and ScyllaDB that may point you towards a unified EAV model over some of the designs you describe in your question. Both Cassandra and ScyllaDB lay out keyspaces and databases as files on disk. The number of files are basically products of number of keyspaces times number of tables. So the more keyspaces, tables or combination of the two you have, the more files you'll have on disk. This may be an issue with file descriptors and other os file juggling issues. Due to the long tail of access you mentioned it may be the case that every file is open all the time. That is not so desirable, especially when starting from a cold boot.

[edit for clarity] All things being equal, one keyspace/table will always produce less files than many keyspace/tables. This has nothing to do with the amount of data stored or compaction strategy.

Wide Rows

But getting back to the data model. Ashraful's model has a primary key (userid) and another clustering key (key->column1). Due to the number of "entries" in each user file (500K-2M) and assuming each entry is a row comprised of avg 60 columns, what you're basically doing is creating 500k-2m * 60 avg columns rows per partition key thereby creating very large partitions. Cassandra and Scylla generally don't like very large partitions. Can they handle large partitions, sure. In practice do large partitions impact performance, yes.

Updates or versioning

You mention updates. The base EAV model will only represent the most recent update. There is no versioning. What you could do is add time as a clustering key to ensure that you maintain the historical values of your columns over time.

Reads

If you want all the columns back you could just serialize everything into a json object and put it in a single column. But I imagine that's not what you want. In the primary key (partition key) model of a key/value based system like Cassandra and Scylla, you need to know all the components of the key to get your data back. If you put column1, the unique row identifier, into your primary key you will need to know it in advance, likewise also the other column names if those get put in the primary key as well.

Partitions and Composite Partition Keys

Number of partitions dictate the parallelism of your cluster. The number of total partitions, or cardinality of partitions in your total corpus, has an affect on the utilization of your cluster hardware. More partitions = better parallelism and higher resource utilization.

What I might do here is modify the PRIMARY KEY to include column1. Then I would use column as a clustering key (which not only dictates uniqueness within a partition but also sort order - so consider this in you column naming conventions).

In the following table definition you would need to provide the userid and column1 as equalities in your WHERE clause.

CREATE TABLE data (
    userid bigint,
    column1 text,
    column text,
    value text,
    PRIMARY KEY ( (userid, column1), column )
);

I'd also have a separate table, maybe columns_per_user, that records all columns for each userid. Something like

CREATE TABLE columns_per_user (
    userid bigint,
    max_columns int,
    column_names text
    PRIMARY KEY ( userid )
);

Where max_columns is the total number of columns for this user and column_names are the actual column names. You might also have a column for total number of entries per user, something like user_entries int which would basically be the number of rows in each user csv file.

like image 125
siculars Avatar answered Oct 19 '22 19:10

siculars