Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files. Could you help me?

like image 203
A7med Avatar asked Sep 01 '15 14:09

A7med


People also ask

How do I check if a file exists in Hadoop?

Hadoop command to check whether the file exists or not. Syntax : hdfs dfs -test -e hdfs_path/filename. Example: hdfs dfs -test -e /revisit/content/files/schema.xml. echo $? —> To validate the previous command return code. Explanation

How to check for file presence in pyspark without subprocess?

For Pyspark, you can achieve this without invoking a subprocess using something like: Show activity on this post. I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check. Show activity on this post. Show activity on this post.

How to show activity for a file in HDFS without subprocess?

For a file in HDFS, you can use the hadoop way of doing this: Show activity on this post. For Pyspark, you can achieve this without invoking a subprocess using something like: Show activity on this post. I will say, best way to call this through function which internally check for file presence in the traditional hadoop file check.

How to get valid paths from HDFS?

You can also use following hadoop library to get valid paths from hdfs: Show activity on this post. I assume you have a list of data paths and want to load data for the paths which exists on HDFS. You can pass your path to the get method in FileSystem.


1 Answers

Rigth how it says Tristan Reid:

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path

Once you have the list of files in a directory, it is easy to check if a particular file exist.

I hope it can help somehow.

like image 179
Josemy Avatar answered Nov 15 '22 07:11

Josemy