We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
Cassandra is great for storing and querying large amounts of high-performance data which is why it's often used in IoT analytics and real-time data analytics use cases. You want your analytics platform to leverage and build on the strength of your Cassandra implementation. With Knowi, that is precisely what you get.
Cassandra does it splendidly, given its great speeds and support for real-time analytics through a seamless integration with Apache Spark.
Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families. Each Row is identified by a primary key value. Data is partitioned by the primary key.
Apache Cassandra is a highly reliable database system that provides real-time analysis. Also, it is a highly consistent and scalable, schema-free distributed database system that can manage large data sets across different data centers.
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With