Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - how to skip or ignore empty gzip files when reading

I have a couple of hundred folders with some thousands of gzipped text files each in s3 and I'm trying to read them into a dataframe with spark.read.csv().

Among the files, there are some with zero length, resulting in the error:

java.io.EOFException: Unexpected end of input stream

Code:

df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)

I've tried setting the mode to DROPMALFORMED and reading with sc.textFile() but no luck.

What's the best way to handle empty or broken gzip files?

like image 342
antti Avatar asked Apr 05 '17 11:04

antti


People also ask

Can Spark read gzip files?

Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. In Spark, support for gzip input files should work the same as it does in Hadoop.

What is ArrayType in spark?

Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.


1 Answers

Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:

--conf spark.sql.files.ignoreCorruptFiles=true

like image 59
Radhwane Chebaane Avatar answered Sep 21 '22 16:09

Radhwane Chebaane