Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: get list of files/directories on HDFS path

As per title. I'm aware of textFile but, as the name suggests, it works only on text files. I would need to access files/directories inside a path on either HDFS or a local path. I'm using pyspark.

like image 581
Federico Ponzi Avatar asked Mar 02 '16 14:03

Federico Ponzi


People also ask

How do you list the files in a hdfs directory?

Use the hdfs dfs -ls command to list files in Hadoop archives. Run the hdfs dfs -ls command by specifying the archive directory location.

How do I view folders in hdfs?

If you type hdfs dfs -ls / you will get list of directories in hdfs.

How do I list all files in hdfs and size?

You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.


1 Answers

Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:

URI           = sc._gateway.jvm.java.net.URI Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration   fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())  status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))  for fileStatus in status:     print(fileStatus.getPath()) 
like image 138
volhv Avatar answered Sep 17 '22 12:09

volhv