Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check a file/folder is present using pyspark without getting exception

I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present

from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
    df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
    print("File Exists")
except IOError:
    print("file not found")`

When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"

like image 581
Amareshwar Reddy Avatar asked Apr 09 '19 09:04

Amareshwar Reddy


People also ask

How to check if a file or folder exists in Python?

To check if a file or folder exists we can use the path.exists () function which accepts the path to the file or directory as an argument. It returns a boolean based on the existence of the path. Note: A path is the unique location of a file or directory in a filesystem

How to check if the pyspark Dataframe is empty or not?

In this article, we are going to check if the Pyspark DataFrame or Dataset is Empty or Not. We have Multiple Ways by which we can Check : The isEmpty function of the DataFrame or Dataset returns true when the DataFrame is empty and false when it’s not empty. If the dataframe is empty, invoking “isEmpty” might result in NullPointerException.

How can I check if a file exists using wildcards?

FileSystem method exists does not support wildcards in the file path to check existence. You can instead use globStatus which supports special pattern matching characters like *. If it returns a non-empty list then the file exists else it does not exist:

How do I get the path of a file in spark?

Spark load data and add filename as dataframe column Since you want to store the whole path in a variable, you can achieve this with a combination of dbutils and Regular expression pattern matching. We can use dbutils.fs.ls (path) to return the list of files present in a folder ( storage account or DBFS ).


2 Answers

Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]), here s3 is the prefix of the file system of your cluster.

  def path_exists(path):
    # spark is a SparkSession
    sc = spark.sparkContext
    fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
        sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
        sc._jsc.hadoopConfiguration(),
    )
    return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
like image 176
rosefun Avatar answered Jan 04 '23 03:01

rosefun


fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
like image 32
Prathik Kini Avatar answered Jan 04 '23 03:01

Prathik Kini