I have a couple of hundred folders with some thousands of gzipped text files each in s3 and I'm trying to read them into a dataframe with spark.read.csv()
.
Among the files, there are some with zero length, resulting in the error:
java.io.EOFException: Unexpected end of input stream
Code:
df = spark.read.csv('s3n://my-bucket/folder*/logfiles*.log.gz',sep='\t',schema=schema)
I've tried setting the mode
to DROPMALFORMED
and reading with sc.textFile()
but no luck.
What's the best way to handle empty or broken gzip files?
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. In Spark, support for gzip input files should work the same as it does in Hadoop.
Spark ArrayType is a collection data type that extends the DataType class which is a superclass of all types in Spark. All elements of ArrayType should have the same type of elements.
Starting from Spark 2.1 you can ignore corrupt files by enabling the spark.sql.files.ignoreCorruptFiles option. Add this to your spark-submit or pyspark command:
--conf spark.sql.files.ignoreCorruptFiles=true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With