Let's say I have this structure on HDFS:
/dir1
/dir2
/Name1_2015/
file1.lzo
file2.lzo
file3.lzo
/Name2_2015
file1.lzo
file2.lzo
Name1_2015.lzo
I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo
For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo
Each files are LZO compressed.
How can I do it ?
Thanks
If you don't care much about parallelism here's a bash one-liner:
for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop | hdfs dfs -put - /dir1/$d.lzo ; done
You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.
I would do this with Hive, as follows:
Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table ( all_content string ) PARTITIONED BY (name string) LOCATION "/dir1/dir2" ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:
SET mapreduce.job.reduces=1 # this guarantees it'll make one file SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true
insert into table receiving select all_content from sending_table
You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With