Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to SELECT DISTINCT in cassandra

I want to SELECT DISTINCT in Cassandra, but I think Cassandra doesn't support these operations. How can I do SELECT DISTINCT in Cassandra? Is it possible?

like image 794
Anse danesh Avatar asked Aug 15 '13 04:08

Anse danesh


People also ask

Is primary key unique in Cassandra?

Yes the primary key has to be unique. Otherwise there would be no way to know which row to return when you query with a duplicate key. In your case you can have 2 rows with the same name or with the same surname but not both.

How do I enable filtering in Cassandra?

SELECT * FROM blogs WHERE author='Jonathan Ellis' and time2 = 1418306451235; Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value.

How do I use group by in Cassandra?

The GROUP BY option can condense all selected rows that share the same values for a set of columns into a single row. Using the GROUP BY option, rows can be grouped at the partition key or clustering column level. Consequently, the GROUP BY option only accepts primary key columns in defined order as arguments.


2 Answers

CQL 3.1.1 and onwards support DISTINCT operation only for partition keys.

SELECT statement now allows listing the partition keys (using the DISTINCT modifier). See CASSANDRA-4536.

Select Syntax

select_statement ::=  SELECT [ JSON | DISTINCT ] ( select_clause | '*' )
                      FROM table_name
                      [ WHERE where_clause ]
                      [ GROUP BY group_by_clause ]
                      [ ORDER BY ordering_clause ]
                      [ PER PARTITION LIMIT (integer | bind_marker) ]
                      [ LIMIT (integer | bind_marker) ]
                      [ ALLOW FILTERING ]
select_clause    ::=  selector [ AS identifier ] ( ',' selector [ AS identifier ] )
selector         ::=  column_name
                      | term
                      | CAST '(' selector AS cql_type ')'
                      | function_name '(' [ selector ( ',' selector )* ] ')'
                      | COUNT '(' '*' ')'
where_clause     ::=  relation ( AND relation )*
relation         ::=  column_name operator term
                      '(' column_name ( ',' column_name )* ')' operator tuple_literal
                      TOKEN '(' column_name ( ',' column_name )* ')' operator term
operator         ::=  '=' | '<' | '>' | '<=' | '>=' | '!=' | IN | CONTAINS | CONTAINS KEY
group_by_clause  ::=  column_name ( ',' column_name )*
ordering_clause  ::=  column_name [ ASC | DESC ] ( ',' column_name [ ASC | DESC ] )*

like image 107
Babar Avatar answered Sep 28 '22 12:09

Babar


As others wrote, there is no support for distinct on columns in Cassandra, only on primary key. Two options to do this with cassandra:

  1. Process in application - requires loading reading entire table from server and doing the distinct in code.
  2. Create a secondary table in which the key is the column you want to be distinct, and double insert there when doing an operation on original table.

The decision between the two depends on your actual data structure/size and needs, if the table size is small or you do this operation very little, option 1 will be enough and fast, if the table is large, and/or you do this query a lot of times go with #2.

like image 31
Moshe Eshel Avatar answered Sep 28 '22 12:09

Moshe Eshel