Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance difference between synchronous SELECT + INSERT vs INSERT ... IF NOT EXISTS in CQL?

I have a CQL table (cql 3, cassandra 2.0.*) that looks something like:

CREATE TABLE IF NOT EXISTS user_things (
   user_id bigint,
   thing_id bigint,
   created_at timeuuid,
  PRIMARY KEY (user_id, thing_id)
);

I want to do an insert like

INSERT INTO user_things (user_id, thing_id, created_at) VALUES (?, ?, now())

but only if the row doesn't exist.

I could do this in two synchronous statements (first a SELECT, followed by an INSERT if the SELECT didn't return a row) or I could use INSERT ... IF NOT EXISTS.

The CQL docs state "But please note that using IF NOT EXISTS will incur a non negligible performance cost (internally, Paxos will be used) so this should be used sparingly."

I'm wondering if anybody has done benchmarking to see what is more performant if we have lots of these operations happening? (say hundreds a second)

like image 878
v_krishna Avatar asked Sep 15 '15 23:09

v_krishna


1 Answers

It depends a lot on what topology you are using. The IF NOT EXISTS is pretty fast if you restrict it to a local data center (with LOCAL_SERIAL) and use a small replication factor. If you try to use it across multiple data centers or with higher replication factors, then it slows down dramatically. There is an open ticket to improve its performance, so hopefully that will get completed soon since it is currently an overly expensive operation with lots of round trips.

Another thing that will slow IF NOT EXISTS down is when you use it on clustered rows. It seems to work the fastest when your table only has a compound partition key and no clustering columns.

If you go the read before write route, then you've got other problems to deal with. First off you will have a race condition since if two clients do a read around the same time, and then both decide to do a write, you'll get one overwriting the other, which kind of makes the read pointless (see another approach here: collision detection. If somehow you don't mind the race condition, and use a low consistency like ONE for the read and write, then it will likely outperform IF NOT EXISTS.

Pretty much you'd have to benchmark it for your system and schema to see which one was faster in your situation.

like image 91
Jim Meyer Avatar answered Jan 04 '23 00:01

Jim Meyer