Best practice modeling data for Cassandra databases

Tags:

I'm new to Cassandra and am looking for a best practice on how to model data that has this general following structure:

The data is "user" based (per customer) , each is supplying a big data file of around 500K-2M entries (periodically updated a few times a day - sometimes full update and sometimes only deltas)

Each data file has certain mandatory data fields (~20 mandatory) but can add additional columns at their discretion (up to ~100).

The additional data fields are NOT necessarily the same for the different users (the names of the fields or the types of those fields)

Example (csv format:)

user_id_1.csv

| column1 (unique key per user_id)  |  column2  |  column3 |   ...   |  column10  |  additionalColumn1  |  ...additionalColumn_n |
|-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------|
| user_id_1_key_1                   |  value    |  value   |  value  |  value     |                ...  |  value                 |
| user_id_1_key_2                   |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |
| ....                              |  ...      |  ...     |  ...    |  ...       |                ...  |  ...                   |
| user_id_1_key_2Million            |  ....     |  ....    |  ....   |  ....      |                ...  |  ...                   |


user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types)

|             column1 (unique key per user_id)              |  column2  |  column3 |   ...   |  column10  |  additionalColumn1 (different types than user_id_1 and others)  |  ...additional_column_x |
|-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------|
| user_id_XXX_key_1                                         |  value    |  value   |  value  |  value     |                                                            ...  |  value                  |
| user_id_XXX_key_2                                         |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |
| ....                                                      |  ...      |  ...     |  ...    |  ...       |                                                            ...  |  ...                    |
| user_id_XXX_key_500_thousand (less rows than other user)  |  ....     |  ....    |  ....   |  ....      |                                                            ...  |  ...                    |

Several options I have considered:

Option 1:

Create a "global" keyspace
Create a big table "data" containing everything

Concatenate a user_id column to all of the other columns to the big table (including the non-mandatory columns). The primary key becomes user_id + "column_1" (column_1 is unique per user_id)

                                 Keyspace
+--------------------------------------------------------------------------+
|                                                                          |
|                                                                          |
|                                      Data_Table                          |
|                +  +--------+-------+--------------------------+-----+    |
|                |  |        |       |                          |     |    |
|                |  +-------------------------------------------------+    |
|                |  |        |       |                          |     |    |
|    many rows   |  +-------------------------------------------------+    |
|                |  |        |       |                          |     |    |
|                |  |        |       |                          |     |    |
|                |  |        |       |                          |     |    |
|                |  |        |       |     Many columns         |     |    |
|                |  |        |       +------------------------> |     |    |
|                |  |        |       |                          |     |    |
|                |  +-------------------------------------------------+    |
|                v  +-------------------------------------------------+    |
|                                                                          |
+--------------------------------------------------------------------------+

A few things that I notice right away:

The user_id repeats itself as many times as entries per user
The rows are very sparse for the additional columns (empty null values) since the users don't necessarily share them
Number of users is relatively small so number of additional columns is not huge (10K columns max)
I could compact the additional columns data per user to one column called "meta data" and share it per all user

Option 2:

Create Keyspace per User_id

Create table "data" per keyspace

+-----------------------------------------------------------------------------------+
| column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n  |
+-----------------------------------------------------------------------------------+

keyspace_user1         keyspace_user2                     keyspace_user_n
+----------------+    +---------------+                  +---------------+
|                |    |               |                  |               |
|                |    |               |                  |               |
|   +-+-+--+-+   |    |    +-+--+--+  |                  |   +--+--+---+ |
|   | | |  | |   |    |    | |  |  |  |   many keyspaces |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  | +------------->  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   | | |  | |   |    |    | |  |  |  |                  |   |  |  |   | |
|   +--------+   |    |    +-------+  |                  |   +---------+ |
+----------------+    +---------------+                  +---------------+

notes:

Many keyspaces (keyspace per user)
Avoids adding "user_id" value per each row (I can use the key space name as the user id)
Very few tables per keyspace (in this example only 1 table per keyspace)

Option 3:

1) Create a global keyspace 2) Create a table per user_id (the mandatory columns as well as their additional columns per their table)

+---------------------------------------------------------------+
|                            Keyspace                           |
|                                                               |
|       user_1        user_2                         user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |
|                                                               |
+---------------------------------------------------------------+

Notes

Global keyspace
A table per user_id ("many" tables)
Avoids duplicating the user id per row

Option 4: (Does this make sense?)

Create a multiple keyspaces (for instance "x" number of keyspaces) each holding a range of tables (table per user)

                      keyspace_1                                                                                keyspace_x
+---------------------------------------------------------------+                         +---------------------------------------------------------------+
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
|       user_1        user_2                        user_n/x    |                         |     user_n-x      user_n-x+1                       user_n     |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+------+   +--+--+--+                      +--+--+--+   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |        "X" keyspaces    |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   | +---------------------> |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    |  |   |  |   |  |  |  |                      |  |  |  |   |                         |    |  |   |  |   |  |  |  |                      |  |  |  |   |
|    +--+---+--+   +--+--+--+                      +--+--+--+   |                         |    +--+---+--+   +--+--+--+                      +--+--+--+   |
|                                                               |                         |                                                               |
|                                                               |                         |                                                               |
+---------------------------------------------------------------+                         +---------------------------------------------------------------+

Notes:

Multiple keyspaces
Multiple tables per user
Requires a "lookup" to figure out which keyspace contains the required table

Option 5:

Split data to multiple tables and multiple keyspaces

Notes: 1. Requires "joining" information from multiple tables in some cases 2. Seems to be more complicated

General notes for all scenarios:

There are a magnitude less writes than reads
Many millions of reads per day
Traffic fluctuates per user_id - some user_ids have a lot of traffic and some user_ids have much less traffic . Would need to tune per this metric
Some user_ids are updated (writes) more frequently than others
We have multiple data centers across geographies and should sync
There is a long tail per primary key (some keys are accessed many times while other keys are rarely accessed)

576

asked Sep 25 '17 09:09

Avner Barr

1 Answers

This type of integration challenge is usually solved by an EAV (Entity Attribute Value) data model in relational systems (like the one Ashrafaul demonstrates). The key consideration when considering an EAV model is an unbounded number of columns. An EAV data model may, of course, be mimicked in a CQL system like Cassandra or ScyllaDB. The EAV model lends itself nicely to writes but presents challenges when reading. You haven't really detailed your read considerations. Do you need all columns back or do you need specific columns back per user?

Files

Having said that, there are some further considerations inherent to Cassandra and ScyllaDB that may point you towards a unified EAV model over some of the designs you describe in your question. Both Cassandra and ScyllaDB lay out keyspaces and databases as files on disk. The number of files are basically products of number of keyspaces times number of tables. So the more keyspaces, tables or combination of the two you have, the more files you'll have on disk. This may be an issue with file descriptors and other os file juggling issues. Due to the long tail of access you mentioned it may be the case that every file is open all the time. That is not so desirable, especially when starting from a cold boot.

[edit for clarity] All things being equal, one keyspace/table will always produce less files than many keyspace/tables. This has nothing to do with the amount of data stored or compaction strategy.

Wide Rows

But getting back to the data model. Ashraful's model has a primary key (userid) and another clustering key (key->column1). Due to the number of "entries" in each user file (500K-2M) and assuming each entry is a row comprised of avg 60 columns, what you're basically doing is creating 500k-2m * 60 avg columns rows per partition key thereby creating very large partitions. Cassandra and Scylla generally don't like very large partitions. Can they handle large partitions, sure. In practice do large partitions impact performance, yes.

Updates or versioning

You mention updates. The base EAV model will only represent the most recent update. There is no versioning. What you could do is add time as a clustering key to ensure that you maintain the historical values of your columns over time.

Reads

If you want all the columns back you could just serialize everything into a json object and put it in a single column. But I imagine that's not what you want. In the primary key (partition key) model of a key/value based system like Cassandra and Scylla, you need to know all the components of the key to get your data back. If you put column1, the unique row identifier, into your primary key you will need to know it in advance, likewise also the other column names if those get put in the primary key as well.

Partitions and Composite Partition Keys

Number of partitions dictate the parallelism of your cluster. The number of total partitions, or cardinality of partitions in your total corpus, has an affect on the utilization of your cluster hardware. More partitions = better parallelism and higher resource utilization.

What I might do here is modify the PRIMARY KEY to include column1. Then I would use column as a clustering key (which not only dictates uniqueness within a partition but also sort order - so consider this in you column naming conventions).

In the following table definition you would need to provide the userid and column1 as equalities in your WHERE clause.

CREATE TABLE data (
    userid bigint,
    column1 text,
    column text,
    value text,
    PRIMARY KEY ( (userid, column1), column )
);

I'd also have a separate table, maybe columns_per_user, that records all columns for each userid. Something like

CREATE TABLE columns_per_user (
    userid bigint,
    max_columns int,
    column_names text
    PRIMARY KEY ( userid )
);

Where max_columns is the total number of columns for this user and column_names are the actual column names. You might also have a column for total number of entries per user, something like user_entries int which would basically be the number of rows in each user csv file.

125

answered Oct 19 '22 19:10

siculars

Related questions
                            
                                Laravel with Eloquent doesn't save model properties on database
                            
                                Get MYSQL database size in Java
                            
                                Error: Code too large [duplicate]
                            
                                After I restart PostgreSQL all my tables are empty (zero rows)
                            
                                Relational database versus R/Python data frames
                            
                                Is there a way to actually save a mySQL database to Github?
                            
                                What does the column flag "G" in MySQL Workbench mean
                            
                                Etags used in RESTful APIs are still susceptible to race conditions
                            
                                Storing data in a static class [PHP]
                            
                                Candidate and Superkey
                            
                                What is the alternative of service broker in azure to get notification on insert or update db?
                            
                                Implementing Holidays in an Employee Management System
                            
                                Can I iterate over a cursor in pymssql more than once?
                            
                                How to save multiple values to database from Select2 Widget Yii2
                            
                                Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?
                            
                                SQLAlchemy returning strings instead of datetime objects
                            
                                Business data querying/reporting in service oriented architecture
                            
                                Postgres - how to deploy changes from dev database to live version
                            
                                Changing field order dynamically in mysql
                            
                                relationships between 3 entities in ER diagram--is a ternary enough or are 2 binaries also needed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best practice modeling data for Cassandra databases

Tags:

database

cassandra

cql

database-normalization

scylla

Avner Barr

People also ask

1 Answers

siculars

Recent Activity

Donate For Us