Use Spark to list all files in a Hadoop HDFS directory?

Tags:

I want to loop through all text files in a Hadoop dir and count all the occurrences of the word "error". Is there a way to do a hadoop fs -ls /users/ubuntu/ to list all the files in a dir with the Apache Spark Scala API?

From the given first example, the spark context seems to only access files individually through something like:

val file = spark.textFile("hdfs://target_load_file.txt")

In my problem, I do not know how many nor the names of the files in the HDFS folder beforehand. Looked at the spark context docs but couldn't find this kind of functionality.

883

asked Apr 28 '14 22:04

poliu2s

2 Answers

You can use a wildcard:

val errorCount = sc.textFile("hdfs://some-directory/*")
                   .flatMap(_.split(" ")).filter(_ == "error").count

answered Sep 24 '22 20:09

Daniel Darabos

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


val fs = FileSystem.get( sc.hadoopConfiguration )
var dirs = Stack[String]()
val files = scala.collection.mutable.ListBuffer.empty[String]
val fs = FileSystem.get(sc.hadoopConfiguration)

dirs.push("/user/username/")

while(!dirs.isEmpty){
    val status = fs.listStatus(new Path(dirs.pop()))
    status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
    files+= x.getPath.toString)
}
files.foreach(println)

answered Sep 25 '22 20:09

Animesh Raj Jha

Related questions
                            
                                Scala: when exactly are function parameter types required?
                            
                                What is the meaning of !# (bang-pound) in a sh / Bash shell script?
                            
                                (Play 2.0) Set maximum POST size for AnyContent
                            
                                How do you change the Play 2.1! Framework session cookie name
                            
                                Using find function for maps in scala
                            
                                Why is my Scalacheck/Scalatest PropertyCheckConfig being ignored?
                            
                                How can I use a Future inside an Akka HTTP Directive?
                            
                                What are the alternatives to subtype polymorphism in scala?
                            
                                Scala initialization behaviour
                            
                                Using collect on maps in Scala
                            
                                akka jvm threads vs os threads when performing io
                            
                                What Scala feature allows the plus operator to be used on Any?
                            
                                "return" and "try-catch-finally" block evaluation in scala
                            
                                How to generate sources in an sbt plugin?
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
                            
                                What precisely is a scala evidence parameter
                            
                                Spark SQL filter multiple fields
                            
                                Decoding structured JSON arrays with circe in Scala
                            
                                Run ScalaTest tests in parallel
                            
                                How to exclude resources during packaging with SBT but not during testing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use Spark to list all files in a Hadoop HDFS directory?

Tags:

scala

apache-spark

hadoop

poliu2s

People also ask

2 Answers

Daniel Darabos

Animesh Raj Jha

Recent Activity

Donate For Us