Analytics and Mining of data sitting on Cassandra

Tags:

We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).

What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.

Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?

897

asked Jan 25 '13 23:01

NG Algo

1 Answers

In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.

Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.

If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:

Netezza
Aster/TeraData
Vertica

A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.

Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.

To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .

179

answered Oct 12 '22 05:10

Charles Menguy

Related questions
                            
                                How to avoid OutOfMemoryException when running Hadoop?
                            
                                Installing Hbase / Hadoop on EC2 cluster
                            
                                Apache Spark EOF exception
                            
                                What is difference between Oozie workflow, coordinator and bundle
                            
                                Parallel Algorithms for Generating Prime Numbers (possibly using Hadoop's map reduce)
                            
                                Wordcount program is stuck in hadoop-2.3.0
                            
                                Why does relocation with the maven shade plugin not work?
                            
                                Loop over files in HDFS directory
                            
                                Is there a good library for accessing HBase from Python? [closed]
                            
                                Attempt to do update or delete using transaction manager that does not support these operations
                            
                                How to customize Writable class in Hadoop?
                            
                                How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?
                            
                                Julia on Hadoop? [closed]
                            
                                Spark : multiple spark-submit in parallel
                            
                                Hbase Schema Nested Entity
                            
                                Hadoop Client Node Configuration
                            
                                beeline not able to connect to hiveserver2
                            
                                Using Phoenix with Cloudera Hbase (installed from repo)
                            
                                Read from a hive table and write back to it using spark sql
                            
                                How to update a file in HDFS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Analytics and Mining of data sitting on Cassandra

Tags:

cassandra

hadoop

mapreduce

analytics

NG Algo

People also ask

1 Answers

Charles Menguy

Recent Activity

Donate For Us