Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?
Cqlsh version:
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Table considered:
DESCRIBE TABLE events;
CREATE TABLE events (
userid uuid,
"timestamp" timestamp,
event_type text,
data text,
PRIMARY KEY (userid, "timestamp", event_type)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Table content:
SELECT * FROM events;
userid | timestamp | event_type | data
--------------------------------------+--------------------------+------------+------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 | toto | null
4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 | toto | null
(6 rows)
Request1: Request without DISTINCT
SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(3 rows)
Request2: Same request with DISTINCT
SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;
userid
--------------------------------------
aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
4271a78f-be1c-44ab-a0e8-f25cf6064b0e
(2 rows)
EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)
I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.
It happens that way because in Cassandra CQL DISTINCT
is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE
clause can only operate on partition keys when used with DISTINCT
(which in your case, isn't terribly useful). If you take the DISTINCT
out, WHERE
can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING
).
I feel compelled to mention that ALLOW FILTERING
is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids
after a certain timestamp
) then I would suggest partitioning your data by event_type
instead:
PRIMARY KEY (event_type, "timestamp", userid)
Then you'll be able to run this query without ALLOW FILTERING
.
SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'
Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.
As explained by Aaron, when using the DISTINCT keyword, you can only filter by partition keys. The reason behind this is the algorithm behind DISTINCT queries and the way Cassandra stores the data into disk/memory.
To understand this, I'll make an analogy:
Cassandra stores the information similar to a book index. If you are searching a chapter called "My third chapter" you only have to look at the first level of the index for it, so you only need to do an iterative search in a relatively small set. However, if you are looking for a sub-chapter called "My fourth sub-chapter" belonging to "My second chapter" you will have to do 2 iterative searchs in 2 different sets, both small, provided that the index has at least 2 levels. The deeper you need to go the longer it may take (you still may be lucky and find it very fast if it is at the start of the index but in this kind of algorithms you have to test for the mean and the worst case scenario) and the more complex the index will need to be.
Cassandra does something similar: Keyspace -> Table -> Partition Key -> Clustering Key -> Column The deeper you need to go, more sets you need to have in memory and it will take longer to find anything. The index used to execute DISTINCT queries may even just contain sets until the partition key level, thus only allowing to search for partition keys.
You need to realise that searching any chapter that has a sub-chapter calles "My second sub-chapter" (what would be the analogy to your query) still requires 2 level deep index and 2 level iterative searchs.
If they decide to support DISTINCT use on clustering keys, then your query would be fine. Meanwhile you will have to filter them in the aplication, probably by using a built-in type called set or something similar that handles the repeated values by itself.
Nor the solution proposed by Aaron (using the userid as a clustering key after the timestamp) neither this one (filtering in the client-side) uses the DISTINCT fast mechanism. His proposal doesn't require client-side filtering as it already handles that for you but offers two main drawbacks: it doesn't offer backwards compatibility as you will have to recreate the table and uses a constant partition key and thus doesn't allow Cassandra to distribute this data among its nodes. Remember that every value of the same partition key is stored in the same node.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With