Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to unzip .gz files in a new directory in hadoop?

Tags:

gzip

hadoop

hdfs

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

like image 254
Monica Avatar asked Jan 03 '16 04:01

Monica


People also ask

How do I unzip a .gz file in XML?

Select all the files and folders inside the compressed file, or multi-select only the files or folders you want to open by holding the CTRL key and left-clicking on them. Click 1-click Unzip, and choose Unzip to PC or Cloud in the WinZip toolbar under the Unzip/Share tab.

How do I view a .gz file in HDFS?

Solution. Zcat is a command line utility for viewing the contents of a compressed file without literally uncompressing it. It expands a compressed file to standard output allowing you to have a look at its contents. In addition, zcat is identical to running gunzip -c command.

How do I move a file from one directory to another in Hadoop?

You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.


Video Answer


1 Answers

I can think of achieving it through 3 different ways.

  1. Using Linux command line

    Following command worked for me.

    hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
    

    My gzipped file is Links.txt.gz
    The output gets stored in /tmp/unzipped/Links.txt

  2. Using Java program

    In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

    package com.myorg.hadooptests;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.net.URI;
    
    public class FileDecompressor {
        public static void main(String[] args) throws Exception {
            String uri = args[0];
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create(uri), conf);
            Path inputPath = new Path(uri);
            CompressionCodecFactory factory = new CompressionCodecFactory(conf);
            CompressionCodec codec = factory.getCodec(inputPath);
            if (codec == null) {
                System.err.println("No codec found for " + uri);
                System.exit(1);
            }
            String outputUri =
            CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
            InputStream in = null;
            OutputStream out = null;
            try {
                in = codec.createInputStream(fs.open(inputPath));
                out = fs.create(new Path(outputUri));
                IOUtils.copyBytes(in, out, conf);
            } finally {
                IOUtils.closeStream(in);
                IOUtils.closeStream(out);
            }
        }
    }
    

    This code takes the gz file path as input.
    You can execute this as:

    FileDecompressor <gzipped file name>
    

    For e.g. when I executed for my gzipped file:

    FileDecompressor /tmp/Links.txt.gz
    

    I got the unzipped file at location: /tmp/Links.txt

    It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

    Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

  3. Using Pig script

    You can write a simple Pig script to achieve this.

    I wrote the following script, which works:

    A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
    Store A into '/tmp/tmp_unzipped/' USING PigStorage();
    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain

    /tmp/tmp_unzipped/_SUCCESS
    /tmp/tmp_unzipped/part-m-00000
    

    The part-m-00000 contains the unzipped file.

    Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:

    mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
    rm /tmp/tmp_unzipped/
    

    So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

    Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

like image 128
Manjunath Ballur Avatar answered Sep 20 '22 21:09

Manjunath Ballur