If we have a folder folder
having all .txt
files, we can read them all using sc.textFile("folder/*.txt")
. But what if I have a folder folder
containing even more folders named datewise, like, 03
, 04
, ..., which further contain some .log
files. How do I read these in Spark?
In my case, the structure is even more nested & complex, so a general answer is preferred.
If we have a folder folder having all . txt files, we can read them all using sc. textFile("folder/*. txt") .
Spark – Read multiple text files into single RDD? Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.
In Spark 3.0, there is an improvement introduced for all file based sources to read from a nested directory. User can enable recursiveFileLookup option in the read time which will make spark to read the files recursively. Now the spark will read data from the both files and count will be equal to 4.
Assuming, you are using Scala, create a parallel collection of your files using the hdfs client and the . par convenience method, then map the result onto spark. read and call an action -- voilà, if you have enough resources in the cluster, you'll have all files being read in parallel.
If directory structure is regular, lets say something like this:
folder ├── a │ ├── a │ │ └── aa.txt │ └── b │ └── ab.txt └── b ├── a │ └── ba.txt └── b └── bb.txt
you can use *
wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect() [u'file:/folder/a/a/aa.txt', u'file:/folder/a/b/ab.txt', u'file:/folder/b/a/ba.txt', u'file:/folder/b/b/bb.txt']
Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.
val df= sparkSession.read .option("recursiveFileLookup","true") .option("header","true") .csv("src/main/resources/nested")
This recursively loads the files from src/main/resources/nested and it's subfolders.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With