I want to check if several files exist in hdfs before load them by SparkContext.
I use pyspark. I tried
os.system("hadoop fs -test -e %s" %path)
but as I have a lot of paths to check, the job crashed.
I tried also sc.wholeTextFiles(parent_path)
and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files.
Could you help me?
Hadoop command to check whether the file exists or not. Syntax : hdfs dfs -test -e hdfs_path/filename. Example: hdfs dfs -test -e /revisit/content/files/schema.xml. echo $? —> To validate the previous command return code. Explanation
For Pyspark, you can achieve this without invoking a subprocess using something like: Show activity on this post. I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check. Show activity on this post. Show activity on this post.
For a file in HDFS, you can use the hadoop way of doing this: Show activity on this post. For Pyspark, you can achieve this without invoking a subprocess using something like: Show activity on this post. I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.
You can also use following hadoop library to get valid paths from hdfs: Show activity on this post. I assume you have a list of data paths and want to load data for the paths which exists on HDFS. You can pass your path to the get method in FileSystem.
Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With