Chaining multiple mapreduce tasks in Hadoop streaming

Tags:

I am in scenario where I have two mapreduce jobs. I am more comfortable with python and planning to use it for writing mapreduce scripts and use hadoop streaming for the same. is there a convenient to chain both the jobs following form when hadoop streaming is used?

Map1 -> Reduce1 -> Map2 -> Reduce2

I've heard a lot of methods to accomplish this in java, But i need something for Hadoop streaming.

269

asked Jan 07 '11 14:01

Varadharajan Mukundan

1 Answers

Here is a great blog post on how to use Cascading and Streaming. http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/

The value here is you can mix java (Cascading query flows) with your custom streaming operations in the same app. I find this much less brittle than other methods.

Note, the Cascade object in Cascading allows you to chain multiple Flows (via the above blog post your Streaming job would become a MapReduceFlow).

Disclaimer: I'm the author of Cascading

answered Oct 02 '22 19:10

cwensel

Related questions
                            
                                Using a regex as a template with Python
                            
                                Can Python be made to generate tracing similar to bash's set -x?
                            
                                twisted: unhelpful "AlreadyCalled" error
                            
                                Convert URL to screenshot (script)
                            
                                Monkey patching a Django form class?
                            
                                Python filter / max combo - checking for empty iterator
                            
                                Why I can call 'print' from 'eval'
                            
                                Implement SMPP in Python
                            
                                Numpy.array indexing question
                            
                                How would I go about playing an alarm sound in python?
                            
                                How can I discover if a program is running from command line or from web?
                            
                                Need an example of a POP3 Server or IMAP Server written in Python
                            
                                Pythonic way to verify parameter is a sequence but not string
                            
                                Which scripting language performs better in vs perl vs python vs ruby? [closed]
                            
                                Django search multiple filters
                            
                                Python (newbie) Parse XML from API call
                            
                                CvSize does not exist?
                            
                                How to load current buffer into Python interpreter in Emacs?
                            
                                Python basic data references, list of same reference
                            
                                Recursive expressions with pyparsing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Chaining multiple mapreduce tasks in Hadoop streaming

Tags:

python

hadoop

mapreduce

hadoop-plugins

Varadharajan Mukundan

People also ask

1 Answers

cwensel

Recent Activity

Donate For Us