I'm reading a table using spark.sql() and then trying to print the count.
But some of the files are missing or removed from HDFS directly.
Spark is failing with below Error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/some path.../data
Hive is able to give me give me the count without error for the same query. Table is an external and partitioned table.
I wanted to ignore the missing files and prevent my Spark job from failing. I have searched over the internet and tried setting below config parameters while creating the spark session but no luck.
SparkSession.builder
.config("spark.sql.hive.verifyPartitionPath", "false")
.config("spark.sql.files.ignoreMissingFiles", true)
.config("spark.sql.files.ignoreCorruptFiles", true)
.enableHiveSupport()
.getOrCreate()
Referred https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-properties.html for above config parameters.
val sql = "SELECT count(*) FROM db.table WHERE date=20190710"
val df = spark.sql(sql)
println(df.count)
I'm expecting the spark code to complete successfully without FileNotFoundException even if some of the files are missing from the partition information.
I'm wondering why spark.sql.files.ignoreMissingFiles has no effect.
Spark version is version 2.2.0.cloudera1. Kindly suggest. Thanks in advance.
Setting below config parameter resolved the issue:
For Hive:
mapred.input.dir.recursive=true
For Spark Session:
SparkSession.builder
.config("mapred.input.dir.recursive",true)
.enableHiveSupport()
.getOrCreate()
On further analysis I found that a part of the partition directory is registered as partition location in table and under that many different folders are there and inside each folder we have actual data files. So we need to turn on recursive discovery in spark to read the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With