Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregation queries in Cassandra CQL

Tags:

cassandra

cql

We are currently evaluating Cassandra as the data store for an analytical application. The plan was to dump raw data in Cassandra and then run mainly aggregation queries over it. Looking at CQL, it does not seem to support some traditional SQL operators like:

  • Typical aggregation functions like average, sum, count-Distinct etc.
  • Groupby-having operators

I did not find anything that can help achieve the above in the documentation. Also checked if there were any hooks for providing such functions as extensions. Say like in database map-reduce in Mongodb, or user-defined-functions in Relational DBs.

People do talk about the paid Datastax Enterprise Edition, and that too achieves this not via plain Cassandra, but through separate components like Hadoop-Hive-Pig-Hadoop etc. Or there are suggestions about doing needed pre-aggregations before dumping data to the DB since Cassandra writes are fast.

It looked like too much of overheads, at least for basic stuff we need. Am I missing something fundamental here?

Would highly appreciate help on this.

like image 927
samantp Avatar asked May 08 '14 03:05

samantp


People also ask

How do you aggregate in Cassandra?

Create a function that divides the total value for the selected column by the number of records. Create the user-defined aggregate to calculate the average value in the column: CREATE AGGREGATE cycling. average(int) SFUNC avgState STYPE tuple<int,bigint> FINALFUNC avgFinal INITCOND (0,0);

Can we use aggregate function in Cassandra?

Aggregation is available in cassandra as part of CASSANDRA-4914 which is available in the 2.2.

What are aggregation queries?

An aggregate query is a method of deriving group and subgroup data by analysis of a set of individual data entries. The term is frequently used by database developers and database administrators.

How do I get unique values in Cassandra?

Use the DISTINCT keyword to return only distinct (different) values of partition keys. The FROM clause specifies the table to query. You may want to precede the table name with the name of the keyspace followed by a period (.). If you do not specify a keyspace, Cassandra queries the current keyspace.


2 Answers

Aggregation is available in cassandra as part of CASSANDRA-4914 which is available in the 2.2.0-rc1 release.

like image 147
mikea Avatar answered Oct 02 '22 12:10

mikea


In one particular application we're using Cassandra for the write speed and then have the app compact the data down to a more compressed, slightly aggregated summary form. Then we run an hourly job to copy the the summary form to Postgres table. This approach doesn't score highly for elegance, but it's simple and it means that we can run ad-hoc analytic queries without having to complicate the primary data ingress path or having to build bespoke aggregation into the CQL app.

like image 26
0x6e6562 Avatar answered Oct 02 '22 11:10

0x6e6562