Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore non-existent paths In Pyspark

I am looking for a way to read a bunch of files from S3, but there is a potential for a path to not exist. I would just like to ignore the fact that the path does not exist, and process all information possible. Example I want to read in files:

files_to_read = []
for id in ids_to_process:
    for date in dates_to_process:
        files_to_read.append('s3://bucket/date=' + date + '/id=' + id + '/*.parquet')
sqlContext.read.parquet(*files_to_read)

The issue is that some id's may not have started until a certain date, an while I can figure that out, it's not very easy to do it programmatically. What would the easiest way be to either a) ignore a file if the path does not exist. b) check if a path exists.

I have tried sqlContext.sql("spark.sql.files.ignoreMissingFiles=true"), which does not seem to work. Would there be any similar option that I am missing?

like image 915
Eumcoz Avatar asked Oct 20 '25 04:10

Eumcoz


1 Answers

Here, missing file really means the deleted file under directory after you construct the DataFrame.

It is recommended to judge whether the target file exists in python in advance instead of handing it over to spark.

like image 54
过过招 Avatar answered Oct 23 '25 00:10

过过招