I want to understand what to do in that case.
For example, I have 1TB of text data, and lets assume that 300GB of it is
the word
"Hello".
After each map operation, i will have a collection of key-value pairs of <"Hello",1>.
But as I said, this is a huge collection, 300GB and as I understand , the reducer gets all of it and will crush.
What is the solution for this?
Lets assume that the combiner won't help me here(the WordCount example is just for simplicity) and the data will still be too big for the reducer.
Hadoop handles large files by splitting them to blocks of size 64MB or 128MB (default). These blocks are available across Datanodes and metadata is in Namenode. When mapreduce program runs each block gets a mapper for execution. You cannot set number of mappers.
It depends on how many cores and how much memory you have on each slave. Generally, one mapper should get 1 to 1.5 cores of processors. So if you have 15 cores then one can run 10 Mappers per Node. So if you have 100 data nodes in Hadoop Cluster then one can run 1000 Mappers in a Cluster.
The number of Mappers for a MapReduce job is driven by number of input splits. And input splits are dependent upon the Block size. For eg If we have 500MB of data and 128MB is the block size in hdfs , then approximately the number of mapper will be equal to 4 mappers. Say, HDFS block size is 64 MB and min.
Having said that, there are certain cases where mapreduce is not a suitable choice : Real-time processing. It's not always very easy to implement each and everything as a MR program. When your intermediate processes need to talk to each other(jobs run in isolation).
The intermediate (Mapper) output is stored in the Local File System of the nodes running the mapper task and is cleaned afterwards. Note that this mapper output is NOT stored in HDFS. The reducer indeed gets all the the intermediate key-value pairs for any particular key (i.e.. all 300 GB output for the key 'Hello' will be processed by the same Reducer task). This data is brought to memory only when required.
Hope this helps.
The reducer does get all of that data, but that data is actually written to disk and is only brought into memory as you iterate through the Iteratable of values. In fact, the object that is returned by that iteration is reused for each value: the fields and other state are simply replaced before the object is handed to you. That means you have to explicitly copy the value object in order to have all value objects in memory at the same time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With