I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS? Details : File is csv with tab delimited.

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter): <pre class="prettyprint lang-scala prettyprint-override"><code>val df = spark.read.option("sep", "\t").csv("file.csv.gz") </code></pre> PySpark: <pre class="prettyprint lang-py prettyprint-override"><code>df = spark.read.csv("file.csv.gz", sep='\t') </code></pre> The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

How to read ".gz" compressed file using spark DF or DS?

1 Answers

Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):

val df = spark.read.option("sep", "\t").csv("file.csv.gz")

PySpark:

df = spark.read.csv("file.csv.gz", sep='\t')

The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.

192

answered Sep 20 '22 15:09

Shaido

Related questions
                            
                                How to print elements of particular RDD partition in Spark?
                            
                                Using Apache Spark with HDFS vs. other distributed storage
                            
                                How to use Spark Structured Streaming with Kafka Direct Stream?
                            
                                Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
                            
                                Spark: Transpose DataFrame Without Aggregating
                            
                                Reading multiple files from S3 in parallel (Spark, Java)
                            
                                How to convert RDD of dense vector into DataFrame in pyspark?
                            
                                ClassNotFoundException scala.runtime.LambdaDeserialize when spark-submit
                            
                                overwrite hive partitions using spark
                            
                                Spark cluster fails on bigger input, works well for small
                            
                                How to use Hadoop InputFormats In Apache Spark?
                            
                                Spark multiple contexts
                            
                                How to create a custom Transformer from a UDF?
                            
                                Can not infer schema for type: <type 'str'>
                            
                                How do I run a local Spark 2.x Session?
                            
                                Split Spark DataFrame based on condition
                            
                                Apache Storm vs Apache Samza vs Apache Spark [closed]
                            
                                In what scenarios hash partitioning is preferred over range partitioning in Spark?
                            
                                How to login SSH on Azure Databricks cluster
                            
                                What is the relationship between tasks and partitions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read ".gz" compressed file using spark DF or DS?

Tags:

gzip

apache-spark

apache-spark-sql

apache-spark-dataset

prady

People also ask

1 Answers

Shaido

Recent Activity

Donate For Us