Dealing with a large gzipped file in Spark

Tags:

I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing

scala> val raw = spark.read.format("com.databricks.spark.csv").
     | options(Map("delimiter" -> "\\t", "codec" -> "org.apache.hadoop.io.compress.GzipCodec")).
     | load("s3://path/to/file.gz").
     | repartition(sc.defaultParallelism * 3)
raw: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: string, _c1: string ... 48 more fields
scala> raw.count()

and taking a look at the Spark application UI, I still see only one active executor (the other 14 are dead) with one task, and the job never finishes (or at least I've not waited long enough for it to).

What is going on here? Can someone help me understand how Spark is working in this example?
Should I be using a different cluster configuration?
Unfortunately, I have no control over the mode of compression, but is there an alternative way of dealing with such a file?

432

asked Nov 08 '16 17:11

user4601931

1 Answers

If the file format is not splittable, then there's no way to avoid reading the file in its entirety on one core. In order to parallelize work, you have to know how to assign chunks of work to different computers. In the gzip case, suppose you divide it up into 128M chunks. The nth chunk depends on the n-1-th chunk's position information to know how to decompress, which depends on the n-2-nd chunk, and so on down to the first.

If you want to parallelize, you need to make this file splittable. One way is to unzip it and process it uncompressed, or you can unzip it, split it into several files (one file for each parallel task you want), and gzip each file.

173

answered Sep 28 '22 18:09

Tim

Related questions
                            
                                Creating a Spark DataFrame from an RDD of lists
                            
                                Spark 2.2 Illegal pattern component: XXX java.lang.IllegalArgumentException: Illegal pattern component: XXX
                            
                                Spark: run InputFormat as singleton
                            
                                Spark ML indexer cannot resolve DataFrame column name with dots?
                            
                                Application attempt appattempt_*** doesn't exist in ApplicationMasterService cache
                            
                                How to speed up Spark SQL unit tests?
                            
                                Why is Spark performing worse when using Kryo serialization?
                            
                                Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set
                            
                                Comparison between fasttext and LDA
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
                            
                                pyspark using one task for mapPartitions when converting rdd to dataframe
                            
                                Spark is only using one worker machine when more are available
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Output from Dataproc Spark job in Google Cloud Logging
                            
                                Read and write empty string "" vs NULL in Spark 2.0.1
                            
                                Apache Spark - Dealing with Sliding Windows on Temporal RDDs
                            
                                Caching intermediate results in Spark ML pipeline
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dealing with a large gzipped file in Spark

Tags:

gzip

apache-spark

amazon-emr

user4601931

People also ask

1 Answers

Tim

Recent Activity

Donate For Us