I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

I can think of achieving it through 3 different ways. <ol> <li> Using Linux command line Following command worked for me. <pre class="prettyprint"><code>hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt </code></pre> My gzipped file is <code>Links.txt.gz</code> The output gets stored in <code>/tmp/unzipped/Links.txt</code> </li> <li> Using Java program In <code>Hadoop The Definitve Guide</code> book, there is a section on <code>Codecs</code>. In that section, there is a program to Decompress the output using <code>CompressionCodecFactory</code>. I am re-producing that code as is: <pre class="prettyprint"><code>package com.myorg.hadooptests; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.net.URI; public class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } } } </code></pre> This code takes the gz file path as input. You can execute this as: <pre class="prettyprint"><code>FileDecompressor <gzipped file name> </code></pre> For e.g. when I executed for my gzipped file: <pre class="prettyprint"><code>FileDecompressor /tmp/Links.txt.gz </code></pre> I got the unzipped file at location: <code>/tmp/Links.txt</code> It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <code><input file path> and <output folder></code>. Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have. </li> <li> Using Pig script You can write a simple Pig script to achieve this. I wrote the following script, which works: <pre class="prettyprint"><code>A = LOAD '/tmp/Links.txt.gz' USING PigStorage(); Store A into '/tmp/tmp_unzipped/' USING PigStorage(); mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/ </code></pre> When you run this script, the unzipped contents are stored in a temporary folder: <code>/tmp/tmp_unzipped</code>. This folder will contain <pre class="prettyprint"><code>/tmp/tmp_unzipped/_SUCCESS /tmp/tmp_unzipped/part-m-00000 </code></pre> The <code>part-m-00000</code> contains the unzipped file. Hence, we need to explicitly rename it using following command and finally delete the <code>/tmp/tmp_unzipped</code> folder: <pre class="prettyprint"><code>mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/ </code></pre> So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt). Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have. </li> </ol>

How to unzip .gz files in a new directory in hadoop?

Video Answer

1 Answers

I can think of achieving it through 3 different ways.

Using Linux command line

Following command worked for me.
```
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
```
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt

Using Java program

In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }
        String outputUri =
        CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

This code takes the gz file path as input.
You can execute this as:

FileDecompressor <gzipped file name>

For e.g. when I executed for my gzipped file:

FileDecompressor /tmp/Links.txt.gz

I got the unzipped file at location: /tmp/Links.txt

It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

Using Pig script

You can write a simple Pig script to achieve this.

I wrote the following script, which works:
```
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
```
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
```
The part-m-00000 contains the unzipped file.

Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
```
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).

Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

128

answered Sep 20 '22 21:09

Manjunath Ballur

Related questions
                            
                                Hbase / Hadoop Query Help
                            
                                Hadoop distributions [closed]
                            
                                Add PARTITION after creating TABLE in hive
                            
                                Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                What is meant by sparse data/ datastore/ database?
                            
                                Hadoop gzip compressed files
                            
                                Where does Hadoop store the logs of YARN applications?
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                hadoop 2.2.0 64-bit installing but cannot start
                            
                                identityreducer in the new Hadoop API
                            
                                Merging hdfs files
                            
                                Role of datanode, regionserver in Hbase-hadoop integration
                            
                                Difference between Application Manager and Application Master in YARN?
                            
                                How to get names of the currently running hadoop jobs?
                            
                                How does Hadoop Namenode failover process works?
                            
                                How to change date format in hive?
                            
                                Iterate twice on values (MapReduce)
                            
                                Does Hive have something equivalent to DUAL?
                            
                                Hadoop input split size vs block size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to unzip .gz files in a new directory in hadoop?

Tags:

gzip

hadoop

hdfs

Monica

People also ask

Video Answer

1 Answers

Manjunath Ballur

Recent Activity

Donate For Us