I am looking for a way to read a bunch of files from S3, but there is a potential for a path to not exist. I would just like to ignore the fact that the path does not exist, and process all information possible. Example I want to read in files:
files_to_read = []
for id in ids_to_process:
for date in dates_to_process:
files_to_read.append('s3://bucket/date=' + date + '/id=' + id + '/*.parquet')
sqlContext.read.parquet(*files_to_read)
The issue is that some id's may not have started until a certain date, an while I can figure that out, it's not very easy to do it programmatically. What would the easiest way be to either a) ignore a file if the path does not exist. b) check if a path exists.
I have tried sqlContext.sql("spark.sql.files.ignoreMissingFiles=true"), which does not seem to work. Would there be any similar option that I am missing?
Here, missing file really means the deleted file under directory after you construct the DataFrame.
It is recommended to judge whether the target file exists in python in advance instead of handing it over to spark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With