I have data that looks like:
select * from test;
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 1 | hash1
dep1 | Bart | 1 | hash2
dep1 | Lisa | 3 | hash3
dep1 | Lisa | 4 | hash4
dep1 | John | 5 | hash5
dep1 | Lucy | 6 | hash6
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
And I would like to query the data with a where
clause on batch_number
, an ordering
on batch_number
and an in
predicate on employee
.
In a relational database this would look like
select * from test
where department='dep1'
and employee in ('Bart','Lucy','John')
and batch_number >= 2
order by batch_number desc
limit 3;
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
dep1 | Lucy | 6 | hash6
I'm having some issues modelling a table for this query in Cassandra. department
will be my partition key and hash
needs to be part of the primary key. But I'm struggling with cluster keys and/or ((SSTable attached) secondary) indexes.
Because I want ordering on batch_number
I tried including it as a cluster key:
CREATE TABLE keyspace.test(
department TEXT,
batch_number INT,
hash TEXT,
employee TEXT,
PRIMARY KEY ((department), batch_number, hash)
) WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX tst_emp ON keyspace.test (employee);
But this does not allow for queries with an in
predicate on my index:
select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported"
So I tried adding the employee
column as cluster key as well:
CREATE TABLE keyspace.test(
department TEXT,
batch_number INT,
hash TEXT,
employee TEXT,
PRIMARY KEY ((department), batch_number, hash, employee)
) WITH CLUSTERING ORDER BY (batch_number DESC);
But this fails because I can not put a non-EQ relation on batch_number
:
select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)"
But whenever I put employee
before batch_number
I lose the ability to order on batch_number
:
CREATE TABLE keyspace.test(
department TEXT,
employee TEXT,
batch_number INT,
hash TEXT,
PRIMARY KEY ((department), employee, batch_number, hash)
);
select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
So what table design would allow for such a query? Can this be done in Cassandra?
Edit:
Other queries I would like to be able to run on this table are:
select * from keyspace.test where department='X' and batch_number=Y
and
delete from keyspace.test where department='X'
Using a materialized view that allows you to rearrange your data:
CREATE MATERIALIZED VIEW mv_test AS
SELECT
department,
batch_number,
employee,
hash
FROM
test
WHERE
department IS NOT NULL
AND batch_number IS NOT NULL
AND employee IS NOT NULL
AND hash IS NOT NULL
PRIMARY KEY (department, employee, batch_number, hash)
WITH clustering
ORDER BY
(batch_number DESC);
I can perform the following query:
SELECT * FROM mv_test
WHERE
department = 'dep1'
AND employee IN
(
'Bart',
'Lisa'
)
AND batch_number > 3;
The results are ordered due to the clustering order:
department | employee | batch_number | hash
------------+----------+--------------+-------
dep1 | Bart | 7 | hash7
dep1 | Bart | 7 | hash8
dep1 | Lisa | 4 | hash4
While >
clauses are non-equal clauses, IN
, while having multiple values, is still deterministic, which is why I believe you can filter the keys without issue.
Since batch_number
is the last thing you want to filter, any type of where clause is allowed. I am assuming that you always have department
.
Note that materialized views impact performance. More specifically, write performance. However it is beneficial to read performance as opposed to ALLOW FILTERING
.
UPDATE:
The ordering specified at the end of the materialized view says batch_number
, however, it will first order on department
, then employee
, and then batch_number
, so the order of the batch_number
specifically is not guaranteed. As far as I know there is no way around this. Another database solution may be preferable.
UPDATE 2:
As stated in an Apache mailing chain (see comments below), materialized views are not deemed production-ready. However, Datastax considers them usable, provided they are taken care of using the mentioned best practices. Personally, I have not had any trouble with materialized views. This is of course, with a simple single datacenter cluster and considering the best practices mention more complicated setups, they could break in such circumstances.
You can use an index on the employee
and even remove it from the primary key if you want. You would need to drop the use of IN
but you could split the query and join the results on the client side.
CREATE TABLE tk.test_good(
department TEXT,
batch_number INT,
employee TEXT,
hash TEXT,
PRIMARY KEY ((department), batch_number, hash)
)WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX IF NOT EXISTS employee_idx ON tk.test_good ( employee );
select * from tk.test_good where department='dep1' and employee='Bart' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='Lucy' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='John' and batch_number >= 2 limit 3;
The downside of this approach is that your indexes might get too big. But I don't know the data pool size, so I'll leave that at your consideration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With