Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark/Scala Opening Zipped CSV Files

I am new to Spark and Scala. We have ad event log files formatted as CSV's and then compressed using pkzip. I have seen many examples on how to decompress zipped files using Java, but how would I do this using Scala for Spark? We, ultimately, want to get, extract, and load the data from each incoming file into an Hbase destination table. Maybe this can this be done with the HadoopRDD? After this, we are going to introduce Spark streaming to watch for these files.

Thanks, Ben

like image 965
Ben Avatar asked Feb 18 '14 22:02

Ben


People also ask

How do I open a zip file with Spark?

Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file. rdd1 = sc. newAPIHadoopFile("/Users/myname/data/compressed/target_file.

How do I read multiple CSV files in Spark Scala?

Reading multiple CSV files into RDD Spark RDD's doesn't have a method to read csv file formats hence we will use textFile() method to read csv file like any other text file into RDD and split the record based on comma, pipe or any other delimiter.


1 Answers

In Spark, provided your files have the correct filename suffix (e.g. .gz for gzipped), and it's supported by org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use

sc.textFile(path)

UPDATE: At time of writing their is a bug in Hadoop bzip2 library which means trying to read bzip2 files using spark results in weird exceptions - usually ArrayIndexOutOfBounds.

like image 124
samthebest Avatar answered Sep 18 '22 02:09

samthebest