Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging multiple LZO compressed files on HDFS

Let's say I have this structure on HDFS:

/dir1
    /dir2
        /Name1_2015/
            file1.lzo
            file2.lzo
            file3.lzo
        /Name2_2015
            file1.lzo
            file2.lzo

    Name1_2015.lzo

I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo

For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo

Each files are LZO compressed.

How can I do it ?

Thanks

like image 348
guillaume Avatar asked Jul 24 '15 14:07

guillaume


3 Answers

If you don't care much about parallelism here's a bash one-liner:

for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop  | hdfs dfs -put - /dir1/$d.lzo ; done

You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

like image 60
Mikhail Golubtsov Avatar answered Oct 21 '22 07:10

Mikhail Golubtsov


I would do this with Hive, as follows:

  1. Rename the subdirectories to name=1_2015 and name=2_2015

  2. CREATE EXTERNAL TABLE sending_table ( all_content string ) PARTITIONED BY (name string) LOCATION "/dir1/dir2" ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}

  3. Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.

  4. Run this:

    SET mapreduce.job.reduces=1 # this guarantees it'll make one file SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec SET hive.exec.compress.output=true SET mapreduce.output.fileoutputformat.compress=true

    insert into table receiving select all_content from sending_table

like image 2
Robert Rapplean Avatar answered Oct 21 '22 06:10

Robert Rapplean


You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

like image 1
Karthik Avatar answered Oct 21 '22 06:10

Karthik