Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra table design for query with ORDER, LIMIT and IN predicate

Tags:

I have data that looks like:

select * from test;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            1 | hash1
 dep1       | Bart     |            1 | hash2
 dep1       | Lisa     |            3 | hash3
 dep1       | Lisa     |            4 | hash4
 dep1       | John     |            5 | hash5
 dep1       | Lucy     |            6 | hash6
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8

And I would like to query the data with a where clause on batch_number, an ordering on batch_number and an in predicate on employee.

In a relational database this would look like

select * from test 
  where department='dep1' 
  and employee in ('Bart','Lucy','John') 
  and batch_number >= 2 
  order by batch_number desc 
  limit 3;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8
 dep1       | Lucy     |            6 | hash6

I'm having some issues modelling a table for this query in Cassandra. department will be my partition key and hash needs to be part of the primary key. But I'm struggling with cluster keys and/or ((SSTable attached) secondary) indexes.

Because I want ordering on batch_number I tried including it as a cluster key:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash)
) WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX tst_emp ON keyspace.test (employee);

But this does not allow for queries with an in predicate on my index:

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported"

So I tried adding the employee column as cluster key as well:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash, employee)
) WITH CLUSTERING ORDER BY (batch_number DESC);

But this fails because I can not put a non-EQ relation on batch_number:

select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)"

But whenever I put employee before batch_number I lose the ability to order on batch_number:

CREATE TABLE keyspace.test(
    department      TEXT,
    employee        TEXT,
    batch_number    INT,
    hash            TEXT,
    PRIMARY KEY ((department), employee, batch_number, hash)
);

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"

So what table design would allow for such a query? Can this be done in Cassandra?

Edit:

Other queries I would like to be able to run on this table are:

select * from keyspace.test where department='X' and batch_number=Y 

and

delete from keyspace.test where department='X'
like image 306
Maarten Dhondt Avatar asked Nov 09 '18 10:11

Maarten Dhondt


2 Answers

Using a materialized view that allows you to rearrange your data:

CREATE MATERIALIZED VIEW mv_test AS 
SELECT
   department,
   batch_number,
   employee,
   hash 
FROM
   test 
WHERE
   department IS NOT NULL 
   AND batch_number IS NOT NULL 
   AND employee IS NOT NULL 
   AND hash IS NOT NULL 
PRIMARY KEY (department, employee, batch_number, hash) 
WITH clustering 
ORDER BY
(batch_number DESC);

I can perform the following query:

SELECT * FROM mv_test 
WHERE
   department = 'dep1' 
   AND employee IN 
   (
      'Bart',
      'Lisa'
   )
   AND batch_number > 3;

The results are ordered due to the clustering order:

 department | employee | batch_number | hash
------------+----------+--------------+-------
       dep1 |     Bart |            7 | hash7
       dep1 |     Bart |            7 | hash8
       dep1 |     Lisa |            4 | hash4

While > clauses are non-equal clauses, IN, while having multiple values, is still deterministic, which is why I believe you can filter the keys without issue. Since batch_number is the last thing you want to filter, any type of where clause is allowed. I am assuming that you always have department.

Note that materialized views impact performance. More specifically, write performance. However it is beneficial to read performance as opposed to ALLOW FILTERING.

UPDATE:

The ordering specified at the end of the materialized view says batch_number, however, it will first order on department, then employee, and then batch_number, so the order of the batch_number specifically is not guaranteed. As far as I know there is no way around this. Another database solution may be preferable.

UPDATE 2:

As stated in an Apache mailing chain (see comments below), materialized views are not deemed production-ready. However, Datastax considers them usable, provided they are taken care of using the mentioned best practices. Personally, I have not had any trouble with materialized views. This is of course, with a simple single datacenter cluster and considering the best practices mention more complicated setups, they could break in such circumstances.

like image 123
Limnic Avatar answered Oct 14 '22 09:10

Limnic


You can use an index on the employee and even remove it from the primary key if you want. You would need to drop the use of IN but you could split the query and join the results on the client side.

CREATE TABLE tk.test_good(
    department      TEXT,
    batch_number    INT,
    employee        TEXT,
    hash            TEXT,
    PRIMARY KEY ((department), batch_number, hash)
)WITH CLUSTERING ORDER BY (batch_number DESC);

CREATE INDEX IF NOT EXISTS employee_idx ON tk.test_good ( employee );

select * from tk.test_good where department='dep1' and employee='Bart' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='Lucy' and batch_number >= 2 limit 3;
select * from tk.test_good where department='dep1' and employee='John' and batch_number >= 2 limit 3;

The downside of this approach is that your indexes might get too big. But I don't know the data pool size, so I'll leave that at your consideration.

like image 32
Pedro Gordo Avatar answered Oct 14 '22 08:10

Pedro Gordo