I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?
Map1 -> Reduce1 -> Map2 -> Reduce2
I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.
MapReduce is a computation abstraction that works well with The Hadoop Distributed File System (HDFS).
We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.
Job chaining is a term in MapReduce that refers to launching several steps in the same MapReduce task. With job chaining, the first job sends output to one job, which sends output to the next job in the chain, and so on until the job is complete. It is a form of pipelining MapReduce jobs to make them more manageable.
Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.
Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).
Disclaimer: I'm the author of Cascading
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With