How do you model user data in Cassandra?
We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.
This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.
On the positive side, inserting to tables is done independently.
Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?
Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?
To make this more concrete, let's say we have:
CREATE TABLE example (
pk text PRIMARY KEY
col_a text
col_b text
)
Assume that for a given partition key (pk
), initially both col_a
, and col_b
have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?
Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families.
MongoDB, Cassandra, and other flat scaling databases are examples. Because of its scalability, NoSQL can manage large amounts of data, and as the data expands, NoSQL scales to accommodate it efficiently.
When you want many-to-many mappings or join tables. Cassandra doesn't support a relational schema with foreign keys and join tables. So if you want to write a lot of complex join queries, then Cassandra might not be the right database for you.
Write conflicts are something you shouldn't need to worry about. All INSERTS/UPDATES/DELETES are Upserts in Cassandra. Everything in Cassandra is column based.
Cassandra uses a last-write wins strategy to manage conflict. As you can see by be example below, whenever you change a value the timestamp associated with that column is updated. Since you are running concurrent updates, and one thread will update col_a and another will update col_b.
Initial Insert
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;
pk | col_a | col_b
----+---------+--------------
1 | deckard | Blade Runner
(1 rows)
Timestamps are the same in the initial insert
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+--------------+------------------
1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357
(1 rows)
Once col_b is uptated, it's timestamp changes to reflect the change.
cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
1 | Deckard | 1526916970412357 | Rick | 1526917272641682
(1 rows)
After col_a is updated it too get's its timestamp updated to the new value
cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;
pk | col_a | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
1 | bounty hunter | 1526917323082217 | Rick | 1526917272641682
(1 rows)
My recommendation is that you use one single table that serves your query needs. If you need to query by pk, then create one single table with all columns you need. This way you will have a single wide row that can be read back efficiently, as part of a single query.
The datamodel you describe in option 2 is a bit to relational, and is not optimal for Cassandra. You cannot perform joins natively in cassandra and you should avoid preforming joins on the client side.
Data Mode Rules:
Rule 1: Spread data Evenly across the cluster You will want to create a partition key that will ensure the data is evenly distributed across the cluster and you don't have any hotspots.
Rule 2: Minimize the number of partitions Read Each partition may reside in different nodes, so you should try to create a scenario where your queries go ideally to only one node for performance sake.
Rule 3: Model around your queries
If you want to read a little bit more about this I suggest this datastax page: Basic Rules of Cassandra Data Modeling
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With