For a Big Data project, I'm planning to use spark, which has some nice features like in-memory-computations for repeated workloads. It can run on local files or on top of HDFS. However, in the official documentation, I can't find any hint as to how to process gzipped files. In practice, it can be quite efficient to process .gz files instead of unzipped files. Is there a way to manually implement reading of gzipped files or is unzipping already automatically done when reading a .gz file?

From the Spark Scala Programming guide's section on "Hadoop Datasets": <blockquote> Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. </blockquote> Support for gzip input files should work the same as it does in Hadoop. For example, <code>sc.textFile("myFile.gz")</code> should automatically decompress and read gzip-compressed files (<code>textFile()</code> is actually implemented using Hadoop's <code>TextInputFormat</code>, which supports gzip-compressed files). As mentioned by @nick-chammas in the comments: <blockquote> note that if you call <code>sc.textFile()</code> on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core </blockquote>

Is gzip format supported in Spark?

1 Answers

From the Spark Scala Programming guide's section on "Hadoop Datasets":

Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, Amazon S3, Hypertable, HBase, etc). Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Support for gzip input files should work the same as it does in Hadoop. For example, sc.textFile("myFile.gz") should automatically decompress and read gzip-compressed files (textFile() is actually implemented using Hadoop's TextInputFormat, which supports gzip-compressed files).

As mentioned by @nick-chammas in the comments:

note that if you call sc.textFile() on a gzipped file, Spark will give you an RDD with only 1 partition (as of 0.9.0). This is because gzipped files are not splittable. If you don't repartition the RDD somehow, any operations on that RDD will be limited to a single core

107

answered Oct 04 '22 16:10

Josh Rosen

Related questions
                            
                                Java : clone() operation calling super.clone()
                            
                                Integer vs int: with regard to memory
                            
                                PHP vs. Java are there energy consumption differences?
                            
                                Is there anything wrong with a class with all static methods?
                            
                                How to wildcard include JAR files when compiling?
                            
                                Configuring angularjs with eclipse IDE
                            
                                How to put a jar in classpath in Eclipse? [duplicate]
                            
                                How do I generate a SALT in Java for Salted-Hash?
                            
                                Android import java.nio.file.Files; cannot be resolved
                            
                                Named blocks to limit variable scope: good idea?
                            
                                Java Pair<T,N> class implementation [closed]
                            
                                Try With Resources vs Try-Catch [duplicate]
                            
                                what is @JoinColumn and how it is used in Hibernate
                            
                                Streaming large result sets with MySQL
                            
                                How to set the java.library.path in intelliJ Idea
                            
                                How to wait until an element no longer exists in Selenium
                            
                                Eclipse: How to install a plugin manually?
                            
                                How do I unit test jdbc code in java? [closed]
                            
                                Run Java class file from PHP script on a website
                            
                                Using Graphics2D to overlay text on a BufferedImage and return a BufferedImage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is gzip format supported in Spark?

Tags:

java

gzip

scala

apache-spark

mapreduce

ptikobj

People also ask

1 Answers

Josh Rosen

Recent Activity

Donate For Us