I have a couple of hundred folders with some thousands of gzipped text files each in s3 and I'm trying to read them into a dataframe with <code>spark.read.csv()</code>. Among the files, there are some with zero length, resulting in the error: <blockquote> java.io.EOFException: Unexpected end of input stream </blockquote> Code: <pre class="prettyprint"><code>df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema) </code></pre> I've tried setting the <code>mode</code> to <code>DROPMALFORMED</code> and reading with <code>sc.textFile()</code> but no luck. What's the best way to handle empty or broken gzip files?

Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command: <blockquote> --conf spark.sql.files.ignoreCorruptFiles=true </blockquote>

Spark - how to skip or ignore empty gzip files when reading

Tags:

pyspark

pyspark-sql

spark-dataframe

I have a couple of hundred folders with some thousands of gzipped text files each in s3 and I'm trying to read them into a dataframe with spark.read.csv().

Among the files, there are some with zero length, resulting in the error:

java.io.EOFException: Unexpected end of input stream

Code:

df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)

I've tried setting the mode to DROPMALFORMED and reading with sc.textFile() but no luck.

What's the best way to handle empty or broken gzip files?

342

asked Apr 05 '17 11:04

antti

1 Answers

Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:

--conf spark.sql.files.ignoreCorruptFiles=true

answered Sep 21 '22 16:09

Radhwane Chebaane

Related questions
                            
                                PySpark, Decision Trees (Spark 2.0.0)
                            
                                Spark step on EMR just hangs as "Running" after done writing to S3
                            
                                Spark Dataframes: Skewed Partition after Join
                            
                                Understanding LDA in Spark
                            
                                Dimension mismatch error in Spark ML
                            
                                How do we specify maven dependencies in pyspark
                            
                                spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
                            
                                Spark job failing due to space issue
                            
                                Does CrossValidator in PySpark distribute the execution?
                            
                                Spark UDF not running in parallel
                            
                                access fields of an array within pyspark dataframe
                            
                                Log Loss function in pyspark
                            
                                Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column
                            
                                Issue upon Spark Upgrade : key not found: _PYSPARK_DRIVER_CONN_INFO_PATH
                            
                                Named accumulator in pyspark
                            
                                spark.sql vs SqlContext
                            
                                ECDF plot from a truncated MD5
                            
                                Transferring unroll memory to storage memory failed
                            
                                DataFrame view in PyCharm when using pyspark
                            
                                How to pass variables in spark SQL, using python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With