Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how can i work with large number of small files in hadoop?

Tags:

hadoop

i am new to hadoop and i'm working with large number of small files in wordcount example. it takes a lot of map tasks and results in slowing my execution.

how can i reduce the number of map tasks??

if the best solution to my problem is catting small files to a larger file, how can i cat them?

like image 469
csperson Avatar asked Jan 26 '13 21:01

csperson


2 Answers

If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.

To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.

You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:

job.setInputFormatClass(CombinedInputFormat.class);
like image 113
Charles Menguy Avatar answered Nov 13 '22 09:11

Charles Menguy


Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

like image 22
Praveen Sripati Avatar answered Nov 13 '22 07:11

Praveen Sripati