Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is gzip format supported in Spark?

For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS.

However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files.

Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

like image 520
ptikobj Avatar asked Apr 30 '13 14:04

ptikobj


People also ask

Can Spark read gzip files?

Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. In Spark, support for gzip input files should work the same as it does in Hadoop.

Is gzip Splittable in Spark?

The first naive approach was to process the data and repartition in one go with spark dataframes, which quickly ran into memory issues. Since gzipped files are not splittable, each part file was being processed within a single executor.

Can Pyspark read .gz files?

Spark document clearly specify that you can read gz file automatically: All of Spark's file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/. txt"), and textFile("/my/directory/.

What format is GZIP?

GZIP produces zipped files with the . gz extension. Although it's not commonly used on Windows, this compression format is still popular on UNIX/LINUX. If you receive a GZIP file, you can save it to your desktop and open it with WinZip.


1 Answers

From the Spark Scala Programming guide's section on "Hadoop Datasets":

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz") should automatically decompress and read gzip-compressed files (textFile() is actually implemented using Hadoop's TextInputFormat, which supports gzip-compressed files).

As mentioned by @nick-chammas in the comments:

note that if you call sc.textFile() on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

like image 107
Josh Rosen Avatar answered Oct 04 '22 16:10

Josh Rosen