I'm new to Cassandra and am looking for a best practice on how to model data that has this general following structure:
The data is "user" based (per customer) , each is supplying a big data file of around 500K-2M entries (periodically updated a few times a day - sometimes full update and sometimes only deltas)
Each data file has certain mandatory data fields (~20 mandatory) but can add additional columns at their discretion (up to ~100).
The additional data fields are NOT necessarily the same for the different users (the names of the fields or the types of those fields)
Example (csv format:)
user_id_1.csv
| column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 | ...additionalColumn_n |
|-----------------------------------|-----------|----------|---------|------------|---------------------|------------------------|
| user_id_1_key_1 | value | value | value | value | ... | value |
| user_id_1_key_2 | .... | .... | .... | .... | ... | ... |
| .... | ... | ... | ... | ... | ... | ... |
| user_id_1_key_2Million | .... | .... | .... | .... | ... | ... |
user_id_XXX.csv (notice that the first 10 columns are identical to the other users but the additional columns are different - both the names and their types)
| column1 (unique key per user_id) | column2 | column3 | ... | column10 | additionalColumn1 (different types than user_id_1 and others) | ...additional_column_x |
|-----------------------------------------------------------|-----------|----------|---------|------------|-----------------------------------------------------------------|-------------------------|
| user_id_XXX_key_1 | value | value | value | value | ... | value |
| user_id_XXX_key_2 | .... | .... | .... | .... | ... | ... |
| .... | ... | ... | ... | ... | ... | ... |
| user_id_XXX_key_500_thousand (less rows than other user) | .... | .... | .... | .... | ... | ... |
Several options I have considered:
Option 1:
Concatenate a user_id column to all of the other columns to the big table (including the non-mandatory columns). The primary key becomes user_id + "column_1" (column_1 is unique per user_id)
Keyspace
+--------------------------------------------------------------------------+
| |
| |
| Data_Table |
| + +--------+-------+--------------------------+-----+ |
| | | | | | | |
| | +-------------------------------------------------+ |
| | | | | | | |
| many rows | +-------------------------------------------------+ |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | Many columns | | |
| | | | +------------------------> | | |
| | | | | | | |
| | +-------------------------------------------------+ |
| v +-------------------------------------------------+ |
| |
+--------------------------------------------------------------------------+
A few things that I notice right away:
Option 2:
Create Keyspace per User_id
Create table "data" per keyspace
+-----------------------------------------------------------------------------------+
| column_1 | column_2 | ... | column_n | additional_column_1 | additional_column_n |
+-----------------------------------------------------------------------------------+
keyspace_user1 keyspace_user2 keyspace_user_n
+----------------+ +---------------+ +---------------+
| | | | | |
| | | | | |
| +-+-+--+-+ | | +-+--+--+ | | +--+--+---+ |
| | | | | | | | | | | | | many keyspaces | | | | | |
| | | | | | | | | | | | | +-------------> | | | | | |
| | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | |
| +--------+ | | +-------+ | | +---------+ |
+----------------+ +---------------+ +---------------+
notes:
Option 3:
1) Create a global keyspace 2) Create a table per user_id (the mandatory columns as well as their additional columns per their table)
+---------------------------------------------------------------+
| Keyspace |
| |
| user_1 user_2 user_n |
| +--+---+--+ +--+--+--+ +--+--+--+ |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
| +--+---+--+ +--+--+--+ +--+--+--+ |
| |
| |
+---------------------------------------------------------------+
Notes
Option 4: (Does this make sense?)
Create a multiple keyspaces (for instance "x" number of keyspaces) each holding a range of tables (table per user)
keyspace_1 keyspace_x
+---------------------------------------------------------------+ +---------------------------------------------------------------+
| | | |
| | | |
| user_1 user_2 user_n/x | | user_n-x user_n-x+1 user_n |
| +--+---+--+ +--+--+--+ +--+--+--+ | | +--+------+ +--+--+--+ +--+--+--+ |
| | | | | | | | | | | | | | "X" keyspaces | | | | | | | | | | | | | |
| | | | | | | | | | | | | | +---------------------> | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| +--+---+--+ +--+--+--+ +--+--+--+ | | +--+---+--+ +--+--+--+ +--+--+--+ |
| | | |
| | | |
+---------------------------------------------------------------+ +---------------------------------------------------------------+
Notes:
Option 5:
Split data to multiple tables and multiple keyspaces
Notes: 1. Requires "joining" information from multiple tables in some cases 2. Seems to be more complicated
General notes for all scenarios:
Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families.
The goals of a successful Cassandra Data Model are to choose a partition key that (1) distributes data evenly across the nodes in the cluster; (2) minimizes the number of partitions read by one query, and (3) bounds the size of a partition.
This type of integration challenge is usually solved by an EAV (Entity Attribute Value) data model in relational systems (like the one Ashrafaul demonstrates). The key consideration when considering an EAV model is an unbounded number of columns. An EAV data model may, of course, be mimicked in a CQL system like Cassandra or ScyllaDB. The EAV model lends itself nicely to writes but presents challenges when reading. You haven't really detailed your read considerations. Do you need all columns back or do you need specific columns back per user?
Files
Having said that, there are some further considerations inherent to Cassandra and ScyllaDB that may point you towards a unified EAV model over some of the designs you describe in your question. Both Cassandra and ScyllaDB lay out keyspaces and databases as files on disk. The number of files are basically products of number of keyspaces times number of tables. So the more keyspaces, tables or combination of the two you have, the more files you'll have on disk. This may be an issue with file descriptors and other os file juggling issues. Due to the long tail of access you mentioned it may be the case that every file is open all the time. That is not so desirable, especially when starting from a cold boot.
[edit for clarity] All things being equal, one keyspace/table will always produce less files than many keyspace/tables. This has nothing to do with the amount of data stored or compaction strategy.
Wide Rows
But getting back to the data model. Ashraful's model has a primary key (userid) and another clustering key (key->column1). Due to the number of "entries" in each user file (500K-2M) and assuming each entry is a row comprised of avg 60 columns, what you're basically doing is creating 500k-2m * 60 avg columns rows per partition key thereby creating very large partitions. Cassandra and Scylla generally don't like very large partitions. Can they handle large partitions, sure. In practice do large partitions impact performance, yes.
Updates or versioning
You mention updates. The base EAV model will only represent the most recent update. There is no versioning. What you could do is add time as a clustering key to ensure that you maintain the historical values of your columns over time.
Reads
If you want all the columns back you could just serialize everything into a json object and put it in a single column. But I imagine that's not what you want. In the primary key (partition key) model of a key/value based system like Cassandra and Scylla, you need to know all the components of the key to get your data back. If you put column1
, the unique row identifier, into your primary key you will need to know it in advance, likewise also the other column names if those get put in the primary key as well.
Partitions and Composite Partition Keys
Number of partitions dictate the parallelism of your cluster. The number of total partitions, or cardinality of partitions in your total corpus, has an affect on the utilization of your cluster hardware. More partitions = better parallelism and higher resource utilization.
What I might do here is modify the PRIMARY KEY
to include column1
. Then I would use column
as a clustering key (which not only dictates uniqueness within a partition but also sort order - so consider this in you column naming conventions).
In the following table definition you would need to provide the userid
and column1
as equalities in your WHERE
clause.
CREATE TABLE data (
userid bigint,
column1 text,
column text,
value text,
PRIMARY KEY ( (userid, column1), column )
);
I'd also have a separate table, maybe columns_per_user
, that records all columns for each userid
. Something like
CREATE TABLE columns_per_user (
userid bigint,
max_columns int,
column_names text
PRIMARY KEY ( userid )
);
Where max_columns
is the total number of columns for this user and column_names
are the actual column names. You might also have a column for total number of entries per user, something like user_entries int
which would basically be the number of rows in each user csv file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With