Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select 2000 most recent log entries in cassandra table using CQL (Latest version)

Tags:

cassandra

cql

How do you query and filter by timeuuid, ie assuming you have a table with

create table mystuff(uuid timeuuid primary key, stuff text);

ie how do you do:

select uuid, unixTimestampOf(uuid), stuff
from mystuff
order by uuid desc
limit 2000

I also want to be able to fetch the next older 2000 and so on, but thats a different problem. The error is:

Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.

and just in case it matters, the real table is actually this:

CREATE TABLE audit_event (
  uuid timeuuid PRIMARY KEY,
  event_time bigint,
  ip text,
  level text,
  message text,
  person_uuid timeuuid
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};
like image 213
Jay Avatar asked Aug 16 '13 13:08

Jay


People also ask

How do I get the last record in Cassandra?

1 Answer. Show activity on this post. SELECT * FROM option_data WHERE ts=1 ORDER BY id DESC LIMIT N; ORDER BY clause can only be applied to the second column in a compound primary key.

How do you pick a count in Cassandra?

Cassandra provides standard built-in functions that return aggregate values to SELECT statements. A SELECT expression using COUNT(column_name) returns the number of non-NULL values in a column. A SELECT expression using COUNT(*) returns the number of rows that matched the query. Use COUNT(1) to get the same result.

What is the use of allow filtering option in select query?

If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING. On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient.

How do you use distinct in Cassandra?

In cassandra you can only select the distinct records from Partition Key column or columns. If Partition key consists of multiple columns, you have to provide all of the columns otherwise you will get an error.


1 Answers

I would recommend that you design your table a bit differently. It would be rather hard to achieve what you're asking for with the design you have currently.

At the moment each of your entries in the audit_event table will receive another uuid, internally Cassandra will create many short rows. Querying for such rows is inefficient, and additionally they are ordered randomly (unless using Byte Ordered Partitioner, which you should avoid for good reasons).

However Cassandra is pretty good at sorting columns. If (back to your example) you declared your table like this :

CREATE TABLE mystuff(
  yymmddhh varchar, 
  created timeuuid,  
  stuff text, 
  PRIMARY KEY(yymmddhh, created)
);

Cassandra internally would create a row, where the key would be the hour of a day, column names would be the actual created timestamp and data would be the stuff. That would make it efficient to query.

Consider you have following data (to make it easier I won't go to 2k records, but the idea is the same):

insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '90');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '91');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '92');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '93');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081615', now(), '94');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '95');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '96');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '97');
insert into mystuff(yymmddhh, created, stuff) VALUES ('13081616', now(), '98');

Now lets say that we want to select last two entries (let's a assume for the moment that we know that the "latest" row key to be '13081616'), you can do it by executing query like this:

SELECT * FROM mystuff WHERE yymmddhh = '13081616' ORDER BY created DESC LIMIT 2 ;

which should give you something like this:

 yymmddhh | created                              | stuff
----------+--------------------------------------+-------
 13081616 | 547fe280-067e-11e3-8751-97db6b0653ce |    98
 13081616 | 547f4640-067e-11e3-8751-97db6b0653ce |    97

to get next 2 rows you have to take the last value from the created column and use it for the next query:

SELECT * FROM mystuff WHERE  yymmddhh = '13081616' 
AND created < 547f4640-067e-11e3-8751-97db6b0653ce 
ORDER BY created DESC LIMIT 2 ;

If you received less rows than expected you should change your row key to another hour.

Row key handling / calculation

For now I've assumed that we know the row key with which we want to query the data. If you log a lot of information I'd say that's not the problem - you can take just current time and issue a query with the hour set to what hour we have now. If we run out of rows we can subtract one hour and issue another query.

However if you don't know where your data lies, or if it's not distributed evenly, you can create metadata table, where you'd store the information about the row keys:

CREATE TABLE mystuff_metadata(
  yyyy varchar, 
  yymmddhh varchar, 
  PRIMARY KEY(yyyy, yymmddhh)
) WITH COMPACT STORAGE;

The row keys would be organized by a year, so to get the latest row key from the current year you'd have to issue a query:

SELECT yymmddhh 
FROM  mystuff_metadata where yyyy = '2013' 
ORDER BY yymmddhh DESC LIMIT 1;

Your audit software would have to make an entry to that table on start and later on each hour change (for example before inserting data to mystuff).

like image 148
lpiepiora Avatar answered Dec 01 '22 17:12

lpiepiora