I am trying to keep a check for the file whether it is present or not before reading it from my pyspark in databricks to avoid exceptions? I tried below code snippets but i am getting exception when file is not present
from pyspark.sql import *
from pyspark.conf import SparkConf
SparkSession.builder.config(conf=SparkConf())
try:
df = sqlContext.read.format('com.databricks.spark.csv').option("delimiter",",").options(header='true', inferschema='true').load('/FileStore/tables/HealthCareSample_dumm.csv')
print("File Exists")
except IOError:
print("file not found")`
When i have file, it reads file and "prints File Exists" but when the file is not there it will throw "AnalysisException: 'Path does not exist: dbfs:/FileStore/tables/HealthCareSample_dumm.csv;'"
To check if a file or folder exists we can use the path.exists () function which accepts the path to the file or directory as an argument. It returns a boolean based on the existence of the path. Note: A path is the unique location of a file or directory in a filesystem
In this article, we are going to check if the Pyspark DataFrame or Dataset is Empty or Not. We have Multiple Ways by which we can Check : The isEmpty function of the DataFrame or Dataset returns true when the DataFrame is empty and false when it’s not empty. If the dataframe is empty, invoking “isEmpty” might result in NullPointerException.
FileSystem method exists does not support wildcards in the file path to check existence. You can instead use globStatus which supports special pattern matching characters like *. If it returns a non-empty list then the file exists else it does not exist:
Spark load data and add filename as dataframe column Since you want to store the whole path in a variable, you can achieve this with a combination of dbutils and Regular expression pattern matching. We can use dbutils.fs.ls (path) to return the list of files present in a folder ( storage account or DBFS ).
Thanks @Dror and @Kini. I run spark on cluster, and I must add sc._jvm.java.net.URI.create("s3://" + path.split("/")[2])
, here s3
is the prefix of the file system of your cluster.
def path_exists(path):
# spark is a SparkSession
sc = spark.sparkContext
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(
sc._jvm.java.net.URI.create("s3://" + path.split("/")[2]),
sc._jsc.hadoopConfiguration(),
)
return fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With