Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read all files in a nested folder in Spark

Tags:

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

like image 571
kamalbanga Avatar asked Aug 26 '15 18:08

kamalbanga


People also ask

How can I read all files in a directory using PySpark?

If we have a folder folder having all . txt files, we can read them all using sc. textFile("folder/*. txt") .

How do I read multiple files in spark?

Spark – Read multiple text files into single RDD? Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

How do I read recursively in spark?

In Spark 3.0, there is an improvement introduced for all file based sources to read from a nested directory. User can enable recursiveFileLookup option in the read time which will make spark to read the files recursively. Now the spark will read data from the both files and count will be equal to 4.

How read multiple files in spark which are present in HDFS cluster?

Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.


2 Answers

If directory structure is regular, lets say something like this:

folder ├── a │   ├── a │   │   └── aa.txt │   └── b │       └── ab.txt └── b     ├── a     │   └── ba.txt     └── b         └── bb.txt 

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()  [u'file:/folder/a/a/aa.txt',  u'file:/folder/a/b/ab.txt',  u'file:/folder/b/a/ba.txt',  u'file:/folder/b/b/bb.txt'] 
like image 94
zero323 Avatar answered Sep 20 '22 07:09

zero323


Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

val df= sparkSession.read        .option("recursiveFileLookup","true")       .option("header","true")       .csv("src/main/resources/nested") 

This recursively loads the files from src/main/resources/nested and it's subfolders.

like image 24
Kumar Avatar answered Sep 23 '22 07:09

Kumar