Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala list folders in directory

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

val conf = new Configuration() val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)  val path = new Path("hdfs://sandbox.hortonworks.com/demo/")  val files = fs.listFiles(path, false) 

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

I also tried with:

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true) 

But this also does not help.

Do you have any other idea?

PS: I also checked this thread: Spark iterate HDFS directory but it does not work for me as it does not seem to search on hdfs directory, instead only on the local file system with schema file//.

like image 411
AlexL Avatar asked Oct 28 '15 15:10

AlexL


People also ask

How list all files in a directory and its subdirectories in Hadoop HDFS?

Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> Options: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered. -t: Sort output by modification time (most recent first).

What is the command to initialize spark using Scala in terminal?

The spark-shell command is used to launch Spark with Scala shell. I have covered this in detail in this article. The pyspark command is used to launch Spark with Python shell also call PySpark. The sparkr command is used to launch Spark with R language.


2 Answers

We are using hadoop 1.4 and it doesn't have listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

val fs = FileSystem.get(new Configuration()) val status = fs.listStatus(new Path(YOUR_HDFS_PATH)) status.foreach(x=> println(x.getPath)) 
like image 175
nil Avatar answered Sep 28 '22 18:09

nil


In Spark 2.0+,

import org.apache.hadoop.fs.{FileSystem, Path} val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println) 

Hope this is helpful.

like image 35
Ajay Ahuja Avatar answered Sep 28 '22 17:09

Ajay Ahuja