Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra preventing duplicates

I have a simple table distributed by userId:

create table test (
  userId uuid,
  placeId uuid,
  visitTime timestamp,
  primary key(userId, placeId, visitTime)
) with clustering order by (placeId asc, visitTime desc);

Each pair (userId, placeId) can have either 1 or none visits. visitTime is just some data associated with it, used for sorting in queries like select * from test where userId = ? order by visitTime desc.

How can I require (userId, placeId) to be unique? I need to make sure that

insert into test (userId, placeId, timeVisit) values (?, ?, ?)

won't insert 2nd visit to (userId, placeId) with different time. Checking for existence before inserting isn't atomic, is there a better way?

like image 849
Sebastian Nowak Avatar asked Mar 04 '15 13:03

Sebastian Nowak


2 Answers

Let me understand -- if the couple (userId, placeId) should be unique, (meaning that you don't have to put two rows with this pair of data) what is the timeVisit useful for in the primary key? Why would you perform a query using order by visitTime desc if this will have only one row?

If what you need is to prevent duplication you have 2 ways.

1 - Lightweight transaction -- this, using IF NOT EXISTS will do what you want. But as I explained here lightweight transactions are really slow due to a particular handling by cassandra

2 - USING TIMESTAMP Writetime enforcement - (be careful with it!***) The 'trick' is to force a decreasing TIMESTAMP

Let me give an example:

INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 1000, 'PLEASE DO NOT OVERWRITE ME') using TIMESTAMP 100;

This produces this output

select * from users;

 uid | placeid | otherstuffs                | visittime
-----+---------+----------------------------+-----------
   1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000

Let's now decrease the timestamp

INSERT INTO users (uid, placeid , visittime , otherstuffs ) VALUES ( 1, 2, 2000, 'I WANT OVERWRITE YOU') using TIMESTAMP 90;

Now data in the table have not been updated, since there is a higher TS operation (100) for the couple (uid, placeid) -- in fact here the output has not changed

select * from users;

 uid | placeid | otherstuffs                | visittime
-----+---------+----------------------------+-----------
   1 |       2 | PLEASE DO NOT OVERWRITE ME |      1000

If performance matters then use solution 2, if performance doesn't matter then use solution 1. For solution 2 you could calculate a decreasing timestamp for each write using a fixed number minus the system time millis

eg:

Long decreasingTimestamp = 2_000_000_000_000L - System.currentTimeMillis();

*** this solution might lead to unexpected behaviour if, for instance, you want delete and then reinsert data. It is important to know that once you delete data you will be able to write them again only if the write operation will have a higher timestamp of the deletion one (if not specified, the timestamp used is the one of the machine)

HTH,
Carlo

like image 166
Carlo Bertuccini Avatar answered Oct 31 '22 16:10

Carlo Bertuccini


With Cassandra each primary key (row key + clustering key) combination is unique. So if you have an entry with the primary key (A, B, C) and you insert another one, new, with the same (A, B, C) values, the old one will be overwritten.

In your case, you have a timeVisit attribute in the primary key, which makes this unusable in your case. You might want to rethink your scheme so you leave the timeVisit attribute out.

like image 34
Aleksandar Stojadinovic Avatar answered Oct 31 '22 17:10

Aleksandar Stojadinovic