How to recursively read Hadoop files from directory using Spark?

Question

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).

directory
   -> folder1
      -> part_001...
      -> part_002...
   -> folder2
      -> part_001...
   ...

Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?

I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):

  var job: Job = null
  try {
    job = Job.getInstance()
    FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
    FileInputFormat.setInputDirRecursive(job, true)
  } catch {
    case ioe: IOException => ioe.printStackTrace(); System.exit(1);
  }
  val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values

I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?

dbustosp · Accepted Answer

If you are using Spark, you can do this using wilcards as follow:

scala>sc.textFile("path/*/*")

sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.

Be careful with the following flag:

scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive") 
> res6: String = null

Yo should set this flag to true:

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

Paul · Answer

I have found that the parameters must be set in this way:

.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

How to recursively read Hadoop files from directory using Spark?

Tags:

apache-spark

hadoop

user7379562

2 Answers

dbustosp

Paul

Recent Activity

Donate For Us

How to recursively read Hadoop files from directory using Spark?

Tags:

apache-spark

hadoop

user7379562

2 Answers

dbustosp

Paul

Related questions

Recent Activity

Donate For Us