Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is the easiest way to combine small HDFS blocks?

Tags:

hadoop

hdfs

flume

I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.

Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?

like image 887
KARASZI István Avatar asked Dec 13 '10 14:12

KARASZI István


3 Answers

The GNU coreutils split could do the work.

If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:

hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

or with --line-bytes option:

hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/
like image 185
KARASZI István Avatar answered Sep 21 '22 00:09

KARASZI István


My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.

I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.

like image 31
Donald Miner Avatar answered Sep 22 '22 00:09

Donald Miner


You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp

This is essentially a map-reduce technique for merging small files.

like image 20
Luis R. Avatar answered Sep 19 '22 00:09

Luis R.