Which is the easiest way to combine small HDFS blocks?

Question

I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.

Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?

KARASZI István · Accepted Answer

The GNU coreutils split could do the work.

If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:

hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

or with --line-bytes option:

hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

Donald Miner · Answer

My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.

I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.

Luis R. · Answer

You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp

This is essentially a map-reduce technique for merging small files.

Which is the easiest way to combine small HDFS blocks?

Tags:

hadoop

hdfs

flume

KARASZI István

3 Answers

KARASZI István

Donald Miner

Luis R.

Recent Activity

Donate For Us

Which is the easiest way to combine small HDFS blocks?

Tags:

hadoop

hdfs

flume

KARASZI István

3 Answers

KARASZI István

Donald Miner

Luis R.

Related questions

Recent Activity

Donate For Us