how can i work with large number of small files in hadoop?

Question

i am new to hadoop and i'm working with large number of small files in wordcount example. it takes a lot of map tasks and results in slowing my execution.

how can i reduce the number of map tasks??

if the best solution to my problem is catting small files to a larger file, how can i cat them?

Charles Menguy · Accepted Answer

If you're using something like TextInputFormat, the problem is that each file has at least 1 split, so the upper bound of the number of maps is the number of files, which in your case where you have many very small files you will end up with many mappers processing each very little data.

To remedy to that, you should use CombineFileInputFormat which will pack multiple files into the same split (I think up to the block size limit), so with that format the number of mappers will be independent of the number of files, it will simply depend on the amount of data.

You will have to create your own input format by extending from CombineFileInputFormt, you can find an implementation here. Once you have your InputFormat defined, let's called it like in the link CombinedInputFormat, you can tell your job to use it by doing:

job.setInputFormatClass(CombinedInputFormat.class);

Praveen Sripati · Answer

Cloudera posted a blog on small files problem sometime back. It's an old entry, but the suggested method still applies.

how can i work with large number of small files in hadoop?

Tags:

hadoop

csperson

2 Answers

Charles Menguy

Praveen Sripati

Recent Activity

Donate For Us

how can i work with large number of small files in hadoop?

Tags:

hadoop

csperson

2 Answers

Charles Menguy

Praveen Sripati

Related questions

Recent Activity

Donate For Us