I am trying to read the last 4 months of data from s3 using pyspark and process the data but am receiving the following exception.
org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3://path_to_clickstream/date=201508*
On the first day of each month due to there not being an entry in the s3 path (a separate job processes and uploads data onto the s3 path and my job runs before that one), the job fails. I was wondering if there was a way for me to catch this exception and allow the job to continue processing all the paths that exist?
You can simply try to trigger a cheap action just after the load and catch Py4JJavaError
:
from py4j.protocol import Py4JJavaError
def try_load(path):
rdd = sc.textFile(path)
try:
rdd.first()
return rdd
except Py4JJavaError as e:
return sc.emptyRDD()
rdd = try_load(s3_path)
if not rdd.isEmpty():
run_the_rest_of_your_code(rdd)
Edit:
If you want to handle multiple paths you can process each one separately and combine the results:
paths = [
"s3://path_to_inputdir/month1*/",
"s3://path_to_inputdir/month2*/",
"s3://path_to_inputdir/month3*/"]
rdds = sc.union([try_load(path) for path in paths])
If you want a better control you can list content and load known files.
If at least one of theses paths is non-empty you should be able to make things even simpler and use glob like this:
sc.textFile("s3://path_to_inputdir/month[1-3]*/")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With