How to open/stream .zip files through Spark?

3 Answers

There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.

import zipfile
import io

def zip_extract(x):
    in_memory_data = io.BytesIO(x[1])
    file_obj = zipfile.ZipFile(in_memory_data, "r")
    files = [i for i in file_obj.namelist()]
    return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()

In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.

172

answered Oct 14 '22 16:10

TrigonaMinima

@user3591785 pointed me in the correct direction, so I marked his answer as correct.

For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/

Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.

    rdd1  = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());

The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.

JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {

    @Override
    public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
        List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();

        InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
        BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));

        String line;

        while ((line = br.readLine()) != null) {

        Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
            newList.add(newTuple);
        }
        return newList;
    }
});

answered Oct 14 '22 18:10

JeffLL

Please try the code below:

using API sparkContext.newAPIHadoopRDD(
    hadoopConf,
    InputFormat.class,
    ImmutableBytesWritable.class, Result.class)

answered Oct 14 '22 16:10

Tinku

Related questions
                            
                                Can i point multiple location to same hive external table?
                            
                                HBase Error - assignment of -ROOT- failure
                            
                                Hadoop: how to access (many) photo images to be processed by map/reduce?
                            
                                To change replication factor of a directory in hadoop
                            
                                Checksum verification in Hadoop
                            
                                copyFromLocal: unexpected URISyntaxException
                            
                                Apache Hive How to round off to 2 decimal places?
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                How to get file size
                            
                                Mapper input Key-Value pair in Hadoop
                            
                                Hadoop 2.2.0 : "name or service not known" Warning
                            
                                How to get ID of a map task in Spark?
                            
                                hadoop fs -du gives two data columns
                            
                                org.apache.hadoop.mapred.FileAlreadyExistsException
                            
                                error in namenode starting
                            
                                Hadoop YARN: Get a list of available queues
                            
                                How to connect to Hadoop/Hive from .NET
                            
                                Hive ParseException - cannot recognize input near 'end' 'string'
                            
                                How do you retrieve the replication factor info in Hdfs files?
                            
                                What is the difference between single node & pseudo-distributed mode in Hadoop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to open/stream .zip files through Spark?

Tags:

apache-spark

hadoop

JeffLL

People also ask

3 Answers

TrigonaMinima

JeffLL

Tinku

Recent Activity

Donate For Us