I have a dataset formed by lots of small files (average 30-40 MB each). I wanted to run analytics on them by MapReduce but with each job, the mapper will read the files again which creates a heavy load on I/O performance (overheads etc.).
I wanted to know if it is possible to use the mapper once, emit various different outputs for different reducers? As I looked around, I saw that multiple reducers are not possible, but only possible thing is job chaining. However, I want to run these jobs in parallel, not sequentially, as they will all use the same dataset as input and run different analytics. So, in summary, the thing I want is something like below:
Reducer = Analytics1 /
Mapper - Reducer = Analytics2
\ Reducer = Analytics3 ...
Is this possible? Or do you have any suggestions for a workaround? Please give me some ideas. Reading these small files all over again creates a huge overhead and performance reduction for my analysis.
Thanks in advance!
Edit: I forgot to mention that I'm using Hadoop v2.1.0-beta with YARN.
You can:
Some useful references on Apache Tez:
EDIT: Added the following regarding Alternative 1:
You could also make the mapper generate a key indicating to which analytics process the output is intended. Hadoop would automatically group records by this key, and send them all to the same reducer. The value generated by the mappers would be a tuple in the form <k,v>
, where the key (k
) is the original key you intended to generate. Thus, the mapper generates <k_analytics, <k,v>>
records. The reducer, has a reducer method that reads the key, and depending on the key, calls the appropriate analytics method (within your reducer class). This approach would work, but only if your reducers do not have to deal with huge amounts of data, since you'll likely need to keep it in memory (in a list or a hashtable) while you do the analytics process (as the <k,v>
tuples won't be sorted by their key). If this is not something your reducer can handle, then the custom partitioner suggested by @praveen-sripati may be an option to explore.
EDIT: As suggested by @judge-mental, alternative 1 can be further improved by having the mappers issue <<k_analytics, k>, value>
; in other words, make the key within the analytics type part of the key, instead of the value, so that a reducer will receive all the keys for one analytics job grouped together and can perform streaming operations on the values without having to keep them in RAM.
It might be possible by using a custom partitioner. The custom partitioner will redirect the output of the mapper to appropriate reducer based on the key. So, the key of the mapper output would be R1*, R2*, R3***. Need to look into the pros and the cons of this approach.
As mentioned Tez is one of the alternative, but it is still under the incubator phase.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With