How to ignore non-existent paths In Pyspark

Question

I am looking for a way to read a bunch of files from S3, but there is a potential for a path to not exist. I would just like to ignore the fact that the path does not exist, and process all information possible. Example I want to read in files:

files_to_read = []
for id in ids_to_process:
    for date in dates_to_process:
        files_to_read.append('s3://bucket/date=' + date + '/id=' + id + '/*.parquet')
sqlContext.read.parquet(*files_to_read)

The issue is that some id's may not have started until a certain date, an while I can figure that out, it's not very easy to do it programmatically. What would the easiest way be to either a) ignore a file if the path does not exist. b) check if a path exists.

I have tried sqlContext.sql("spark.sql.files.ignoreMissingFiles=true"), which does not seem to work. Would there be any similar option that I am missing?

过过招 · Accepted Answer

Here, missing file really means the deleted file under directory after you construct the DataFrame.

It is recommended to judge whether the target file exists in python in advance instead of handing it over to spark.

How to ignore non-existent paths In Pyspark

Tags:

amazon-s3

apache-spark

apache-spark-sql

pyspark

Eumcoz

1 Answers

过过招

Recent Activity

Donate For Us

How to ignore non-existent paths In Pyspark

Tags:

amazon-s3

apache-spark

apache-spark-sql

pyspark

Eumcoz

1 Answers

过过招

Related questions

Recent Activity

Donate For Us