If we have a folder <code>folder</code> having all <code>.txt</code> files, we can read them all using <code>sc.textFile("folder/*.txt")</code>. But what if I have a folder <code>folder</code> containing even more folders named datewise, like, <code>03</code>, <code>04</code>, ..., which further contain some <code>.log</code> files. How do I read these in Spark? In my case, the structure is even more nested & complex, so a general answer is preferred.

If directory structure is regular, lets say something like this: <pre class="prettyprint"><code>folder ├── a │ ├── a │ │ └── aa.txt │ └── b │ └── ab.txt └── b ├── a │ └── ba.txt └── b └── bb.txt </code></pre> you can use <code>*</code> wildcard for each level of nesting as shown below: <pre class="prettyprint"><code>>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect() [u'file:/folder/a/a/aa.txt', u'file:/folder/a/b/ab.txt', u'file:/folder/b/a/ba.txt', u'file:/folder/b/b/bb.txt'] </code></pre>

Read all files in a nested folder in Spark

Tags:

If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.

571

asked Aug 26 '15 18:08

kamalbanga

2 Answers

If directory structure is regular, lets say something like this:

folder ├── a │   ├── a │   │   └── aa.txt │   └── b │       └── ab.txt └── b     ├── a     │   └── ba.txt     └── b         └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()  [u'file:/folder/a/a/aa.txt',  u'file:/folder/a/b/ab.txt',  u'file:/folder/b/a/ba.txt',  u'file:/folder/b/b/bb.txt']

answered Sep 20 '22 07:09

zero323

Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

val df= sparkSession.read        .option("recursiveFileLookup","true")       .option("header","true")       .csv("src/main/resources/nested")

This recursively loads the files from src/main/resources/nested and it's subfolders.

answered Sep 23 '22 07:09

Kumar

Related questions
                            
                                What is the role of 'bottom' (⊥) in Haskell function definitions?
                            
                                How to properly wrap constructors with decorators in TypeScript
                            
                                Differences of using Component template vs templateUrl
                            
                                How to get possibly overlapping matches in a string
                            
                                Create date - Carbon in Laravel
                            
                                Where is Microsoft.Practices.Unity package?
                            
                                void android.support.v4.app.Fragment.setMenuVisibility(boolean)' on a null object reference
                            
                                iOS - Getting desired shadow above UITabBar
                            
                                Is there a way to build the mobile nav bar in ng2-bootstrap?
                            
                                Reassigning in pointer method receiver
                            
                                Missing dependency 'object java.lang.Object in compiler mirror' when trying to run Pay Java Seed in Activator UI
                            
                                ES7 Object.entries() in TypeScript not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With