Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS File Existance check in Pyspark

Can any one suggest the best way to check file existence in pyspark.

currently am using below method to check , please advise.

def path_exist(path):

try:
    rdd=sparkSqlCtx.read.format("orc").load(path)
    rdd.take(1)
    return True

except Exception as e:
    return False
like image 438
Mohammad Umar Farooq Avatar asked Dec 02 '22 11:12

Mohammad Umar Farooq


1 Answers

You can use Java API org.apache.hadoop.fs.{FileSystem, Path} by Py4j.

jvm = spark_session._jvm
jsc = spark_session._jsc
fs = jvm.org.apache.hadoop.fs.FileSystem.get(jsc.hadoopConfiguration())
if fs.exists(jvm.org.apache.hadoop.fs.Path("/foo/bar")):
    print("/foo/bar exists")
else:
    print("/foo/bar does not exist")
like image 169
emeth Avatar answered Jan 03 '23 03:01

emeth