Skip missing files from hive table in spark to avoid FileNotFoundException

Question

I'm reading a table using spark.sql() and then trying to print the count. But some of the files are missing or removed from HDFS directly.

Spark is failing with below Error:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/some path.../data

Hive is able to give me give me the count without error for the same query. Table is an external and partitioned table.

I wanted to ignore the missing files and prevent my Spark job from failing. I have searched over the internet and tried setting below config parameters while creating the spark session but no luck.

    SparkSession.builder
    .config("spark.sql.hive.verifyPartitionPath", "false")
    .config("spark.sql.files.ignoreMissingFiles", true)
    .config("spark.sql.files.ignoreCorruptFiles", true)
    .enableHiveSupport()
    .getOrCreate()

Referred https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-properties.html for above config parameters.

    val sql = "SELECT count(*) FROM db.table WHERE date=20190710"
    val df = spark.sql(sql)
    println(df.count)

I'm expecting the spark code to complete successfully without FileNotFoundException even if some of the files are missing from the partition information.

I'm wondering why spark.sql.files.ignoreMissingFiles has no effect.

Spark version is version 2.2.0.cloudera1. Kindly suggest. Thanks in advance.

Gopal Tiwari · Accepted Answer

Setting below config parameter resolved the issue:

For Hive:

mapred.input.dir.recursive=true

For Spark Session:

SparkSession.builder
.config("mapred.input.dir.recursive",true)
.enableHiveSupport()
.getOrCreate()

On further analysis I found that a part of the partition directory is registered as partition location in table and under that many different folders are there and inside each folder we have actual data files. So we need to turn on recursive discovery in spark to read the data.

Skip missing files from hive table in spark to avoid FileNotFoundException

Tags:

apache-spark

apache-spark-sql

Gopal Tiwari

1 Answers

Gopal Tiwari

Recent Activity

Donate For Us

Skip missing files from hive table in spark to avoid FileNotFoundException

Tags:

apache-spark

apache-spark-sql

Gopal Tiwari

1 Answers

Gopal Tiwari

Related questions

Recent Activity

Donate For Us