Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Paging Resultsets in Cassandra with compound primary keys - Missing out on rows

Tags:

cassandra

cql3

So, my original problem was using the token() function to page through a large data set in Cassandra 1.2.9, as explained and answered here: Paging large resultsets in Cassandra with CQL3 with varchar keys

The accepted answer got the select working with tokens and chunk size, but another problem manifested itself.

My table looks like this in cqlsh:

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4
  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2
 93.89.124.241 |        county_hedmark |    24
 93.89.124.241 |       county_id_20005 |    24

The primary key is a composite of key and column1. In CLI, the same data looks like this:

get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.

The problem

When using cql with a limit of i.e. 100, the returned results may stop in the middle of a record, like this:

key           | column1               | value
---------------+-----------------------+-------
  85.166.4.140 |       county_finnmark |     4
  85.166.4.140 |       county_id_20020 |     4

leaving these to "rows" (columns) out:

  85.166.4.140 |     municipality_alta |     2
  85.166.4.140 | municipality_id_20441 |     2

Now, when I use the token() function for the next page like, these two rows are skipped:

select * from ip where token(key) > token('85.166.4.140') limit 10;

Result:

key           | column1                | value
---------------+------------------------+-------
 93.89.124.241 |         county_hedmark |    24
 93.89.124.241 |        county_id_20005 |    24
 95.169.53.204 |        county_id_20006 |     2
 95.169.53.204 |         county_oppland |     2

So, no trace of the last two results from the previous IP address.

Question

How can I use token() for paging without skipping over cql rows? Something like:

select * from ip where token(key) > token(key:column1) limit 10;
like image 740
Marius Waldal Avatar asked May 13 '14 07:05

Marius Waldal


People also ask

What is compound key in Cassandra?

A compound primary key consists of a partition key that determines which node stores the data and of clustering column(s) which determine the order of the data on the partition. For a table with a compound primary key, Cassandra uses a partition key that is either simple or composite.

Can we do GROUP BY in Cassandra?

The GROUP BY option can condense all selected rows that share the same values for a set of columns into a single row. Using the GROUP BY option, rows can be grouped at the partition key or clustering column level.


1 Answers

Ok, so I used the info in this post to work out a solution: http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive (section "CQL3 pagination").

First, I execute this cql:

select * from ip limit 5000;

From the last row in the resultset, I get the key (i.e. '85.166.4.140') and the value from column1 (i.e. 'county_id_20020').

Then I create a prepared statement evaluating to

select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;

(I'm guessing it would work also without using the token() function, as the check is now for equal:)

select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;

The resultset now contains the remaining X rows (columns) for this IP. The method then returns all the rows, and the next call to the method includes the last used key ('85.166.4.140'). With this key, I can execute the following select:

select * from ip where token(key) > token('85.166.4.140') limit 5000;

which gives me the next 5000 rows from (and including) the first IP after '85.166.4.140'.

Now, no columns are lost in the paging.

UPDATE

Cassandra 2.0 introduced automatic paging, handled by the client. More info here: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0

(note that setFetchSize is optional and not necessary for paging to work)

like image 91
Marius Waldal Avatar answered Oct 25 '22 05:10

Marius Waldal