What if the reducer's input is too big in Hadoop MapReduce

Tags:

I want to understand what to do in that case.
For example, I have 1TB of text data, and lets assume that 300GB of it is the word "Hello".
After each map operation, i will have a collection of key-value pairs of <"Hello",1>.

But as I said, this is a huge collection, 300GB and as I understand , the reducer gets all of it and will crush.

What is the solution for this?
Lets assume that the combiner won't help me here(the WordCount example is just for simplicity) and the data will still be too big for the reducer.

970

asked Aug 16 '15 00:08

member555

2 Answers

The intermediate (Mapper) output is stored in the Local File System of the nodes running the mapper task and is cleaned afterwards. Note that this mapper output is NOT stored in HDFS. The reducer indeed gets all the the intermediate key-value pairs for any particular key (i.e.. all 300 GB output for the key 'Hello' will be processed by the same Reducer task). This data is brought to memory only when required.

Hope this helps.

150

answered Sep 21 '22 21:09

Ankit Khettry

The reducer does get all of that data, but that data is actually written to disk and is only brought into memory as you iterate through the Iteratable of values. In fact, the object that is returned by that iteration is reused for each value: the fields and other state are simply replaced before the object is handed to you. That means you have to explicitly copy the value object in order to have all value objects in memory at the same time.

answered Sep 19 '22 21:09

Chris Gerken

Related questions
                            
                                Difference between hadoop fs -put and hadoop distcp
                            
                                Hadoop Hive Query: Multi-join
                            
                                Why can't hadoop split up a large text file and then compress the splits using gzip?
                            
                                Splitting SequenceFile in controlled manner - Hadoop
                            
                                How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?
                            
                                Is it possible to read MongoDB data, process it with Hadoop, and output it into a RDBS (MySQL)?
                            
                                Run a Local file system directory as input of a Mapper in cluster
                            
                                Why Hadoop is tightly bound to linux?
                            
                                What is the difference between Rack-local map tasks and Data-local map tasks?
                            
                                Override hadoop's mapreduce.fileoutputcommitter.marksuccessfuljobs in oozie
                            
                                "Map output materialized bytes" vs "map output bytes"
                            
                                How to install Hadoop on Ubuntu
                            
                                Uploading CSV for Impala
                            
                                Flume not writing to HDFS unless killed
                            
                                Connect to Spark SQL via ODBC
                            
                                Hadoop datanode services is not starting in the slaves in hadoop
                            
                                Host and port to use to list a directory in hdfs
                            
                                Social-networking: Hadoop, HBase, Spark over MongoDB or Postgres?
                            
                                Apache SPARK:-Nullpointer Exception on broadcast variables (YARN Cluster mode)
                            
                                how to get database username and password in hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What if the reducer's input is too big in Hadoop MapReduce

Tags:

hadoop

mapreduce

member555

People also ask

2 Answers

Ankit Khettry

Chris Gerken

Recent Activity

Donate For Us