Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What if the reducer's input is too big in Hadoop MapReduce

I want to understand what to do in that case.
For example, I have 1TB of text data, and lets assume that 300GB of it is the word "Hello".
After each map operation, i will have a collection of key-value pairs of <"Hello",1>.

But as I said, this is a huge collection, 300GB and as I understand , the reducer gets all of it and will crush.

What is the solution for this?
Lets assume that the combiner won't help me here(the WordCount example is just for simplicity) and the data will still be too big for the reducer.

like image 970
member555 Avatar asked Aug 16 '15 00:08

member555


People also ask

How does Hadoop process large input?

Hadoop handles large files by splitting them to blocks of size 64MB or 128MB (default). These blocks are available across Datanodes and metadata is in Namenode. When mapreduce program runs each block gets a mapper for execution. You cannot set number of mappers.

What determines the number of reducers of a MapReduce job?

It depends on how many cores and how much memory you have on each slave. Generally, one mapper should get 1 to 1.5 cores of processors. So if you have 15 cores then one can run 10 Mappers per Node. So if you have 100 data nodes in Hadoop Cluster then one can run 1000 Mappers in a Cluster.

How do you decide about number of mappers required for solving any problem with MapReduce?

The number of Mappers for a MapReduce job is driven by number of input splits. And input splits are dependent upon the Block size. For eg If we have 500MB of data and 128MB is the block size in hdfs , then approximately the number of mapper will be equal to 4 mappers. Say, HDFS block size is 64 MB and min.

What kind of problems are not suitable for MapReduce?

Having said that, there are certain cases where mapreduce is not a suitable choice : Real-time processing. It's not always very easy to implement each and everything as a MR program. When your intermediate processes need to talk to each other(jobs run in isolation).


2 Answers

The intermediate (Mapper) output is stored in the Local File System of the nodes running the mapper task and is cleaned afterwards. Note that this mapper output is NOT stored in HDFS. The reducer indeed gets all the the intermediate key-value pairs for any particular key (i.e.. all 300 GB output for the key 'Hello' will be processed by the same Reducer task). This data is brought to memory only when required.

Hope this helps.

like image 150
Ankit Khettry Avatar answered Sep 21 '22 21:09

Ankit Khettry


The reducer does get all of that data, but that data is actually written to disk and is only brought into memory as you iterate through the Iteratable of values. In fact, the object that is returned by that iteration is reused for each value: the fields and other state are simply replaced before the object is handed to you. That means you have to explicitly copy the value object in order to have all value objects in memory at the same time.

like image 40
Chris Gerken Avatar answered Sep 19 '22 21:09

Chris Gerken