Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible: Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?

However, when creating a new job flow, we can only configure a S3 bucket as input data origin.

Any ideas/samples on how to do this?

Thanks!

P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.

like image 328
Víctor Penela Avatar asked Aug 29 '12 12:08

Víctor Penela


1 Answers

How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.

If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.

If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.

like image 161
prestomation Avatar answered Nov 14 '22 14:11

prestomation