I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.
Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?
The GNU coreutils split could do the work.
If the source data are lines - in my case they are - and one line is around 84 bytes
, then an HDFS block 64MB
could contain around 800000
lines:
hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/
or with --line-bytes
option:
hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/
My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.
I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.
You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp
This is essentially a map-reduce technique for merging small files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With