Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop on cassandra database

I am using Cassandra to store my data and hive to process my data. I have 5 machines on which i have set up cassandra and 2 machines I use as analytics node(where hive runs) So I want to ask is does hive do map reduce on just two machines(analytics nodes) and brings data there or it moves the process/computation to 5 cassandra nodes as well and process/compute the data on those machines.(What I know is in hadoop, process moves to data not data to process).

like image 633
Aashish Katta Avatar asked Feb 12 '13 07:02

Aashish Katta


People also ask

Can we use Cassandra with Hadoop?

Deploy Hadoop On Top Of Cassandra For Convenient Data Analytics And Reporting. Cassandra provides highly fault tolerant storage for online systems, and Hadoop excels at data analytics. Layering Hadoop on top of Cassandra. It turns out you can have the best of both worlds by deploying Hadoop on top of Cassandra.

Is Cassandra good for big data?

High-speed Data Writes Without affecting read efficiency, Cassandra lets you store a massive amount of data. It is truly fast as the data are written in Cassandra has a swift pace to store it on hardware or cloud.

Which database is used by Hadoop?

Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.


1 Answers

If you interested to marry Hadoop and Cassandra - the first link should DataStax company which is built around this concept. http://www.datastax.com/ They built and support hadoop with HDFS replaced with cassandra. In best of my understanding - they do have data locality:http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/

There is good answer about Hadoop & Cassandra data locality if you run MapReduce against cassandra Cassandra and MapReduce - minimal setup requirements

Regarding your question - there is a tradeof: a) If you run Hadoop / Hive on separate nodes you loose data locality and thereof your data throughput is limited by your network bandwidth.
b) If you run hadoop / Hive on the same nodes as cassandra runs - you can get data locality but MapReduce processing behind hive queries might clogg your network (and other resources) and thereof affect your quality of service from cassandra.

My suggestion will be to have separate hive nodes if performance of your cassandra cluster are critical.
If your cassandra is mostly used as a data store and do not handle real-time requests - then running hive on each node will improve performance and hardware utilization.

like image 84
David Gruzman Avatar answered Sep 23 '22 21:09

David Gruzman