Merging multiple LZO compressed files on HDFS

Question

Let's say I have this structure on HDFS:

/dir1
    /dir2
        /Name1_2015/
            file1.lzo
            file2.lzo
            file3.lzo
        /Name2_2015
            file1.lzo
            file2.lzo

    Name1_2015.lzo

I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo

For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo

Each files are LZO compressed.

How can I do it ?

Thanks

Mikhail Golubtsov · Accepted Answer

If you don't care much about parallelism here's a bash one-liner:

for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop  | hdfs dfs -put - /dir1/$d.lzo ; done

You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

Robert Rapplean · Answer

I would do this with Hive, as follows:

Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table ( all_content string ) PARTITIONED BY (name string) LOCATION "/dir1/dir2" ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:

SET mapreduce.job.reduces=1 # this guarantees it'll make one file SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true

insert into table receiving select all_content from sending_table

Karthik · Answer

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

Merging multiple LZO compressed files on HDFS

Tags:

java

compression

hadoop

mapreduce

hdfs

guillaume

3 Answers

Mikhail Golubtsov

Robert Rapplean

Karthik

Recent Activity

Donate For Us

Merging multiple LZO compressed files on HDFS

Tags:

java

compression

hadoop

mapreduce

hdfs

guillaume

3 Answers

Mikhail Golubtsov

Robert Rapplean

Karthik

Related questions

Recent Activity

Donate For Us