How to make Spark session read all the files recursively?

Tags:

Displaying the directories under which JSON files are stored:

$ tree -d try/
try/
├── 10thOct_logs1
├── 11thOct
│   └── logs2
└── Oct
    └── 12th
        └── logs3

Task is to read all logs using SparkSession.

Is there an elegant way to read through all the files in directories and then sub-directories recursively?

Few commands that I tried are prone to cause unintentional exclusion.

spark.read.json("file:///var/foo/try/<exp>")

+----------+---+-----+-------+
| <exp> -> | * | */* | */*/* |
+----------+---+-----+-------+
| logs1    | y | y   | n     |
| logs2    | n | y   | y     |
| logs3    | n | n   | y     |
+----------+---+-----+-------+

You can see in the above table that none of the three expressions matches all the directories (located at 3 different depths) at the same time. Frankly speaking, I wasn't expecting the exclusion of 10thOct_logs1 while using the third expression */*/*.

This makes me conclude that whatever files or directories path match against the expression following last / is considered as an exact match, and everything else is ignored.

710

asked Dec 30 '19 10:12

Saurav Sahu

1 Answers

Update

A new option was introduced in Spark 3 to read from nested folder recursiveFileLookup :

spark.read.option("recursiveFileLookup", "true").json("file:///var/foo/try")

For older versions, alternatively, you can use Hadoop listFiles to list recursively all the file paths and then pass them to Spark read:

import org.apache.hadoop.fs.{Path}

val conf = sc.hadoopConfiguration

// get all file paths
val fromFolder = new Path("file:///var/foo/try/")
val logfiles = fromFolder.getFileSystem(conf).listFiles(fromFolder, true)
var files = Seq[String]()
while (logfiles.hasNext) {
       // one can filter here some specific files
       files = files :+ logfiles.next().getPath().toString
}

// read multiple paths
val df = spark.read.csv(files: _*)

df.select(input_file_name()).distinct().show(false)


+-------------------------------------+
|input_file_name()                    |
+-------------------------------------+
|file:///var/foo/try/11thOct/log2.csv |
|file:///var/foo/try/10thOct_logs1.csv|
|file:///var/foo/try/Oct/12th/log3.csv|
+-------------------------------------+

128

answered Nov 01 '22 12:11

blackbishop

Related questions
                            
                                Java regex enclose words in brackets
                            
                                Sed replace asterisk symbols
                            
                                RegEx for ISIN with at least 1 number
                            
                                Python regex, match group span (start and end)
                            
                                How to use regex for jasmine matchers
                            
                                How to interpret this regular expression /[\W_]/g
                            
                                How can I normalize / asciify Unicode characters in Google Sheets?
                            
                                Match exactly one of each from set of characters
                            
                                Validate phone number with Symfony
                            
                                Is it faster to use alternation than subsequent replacements in regular expressions
                            
                                Replace All Occurrences using Oracle SQL regexp_replace Case-insensitive
                            
                                Regex backreferences in Java
                            
                                Extract Python dictionary from string
                            
                                TypeError: sequence item 1: expected a bytes-like object, str found
                            
                                nginx deny access to .log file extension
                            
                                Why does `perl -pe 's/$/\n/g'` add 2 blank lines?
                            
                                regex in django 2.0 re_path
                            
                                Get text from href tag after specific class
                            
                                R regex to match beginning and end of string, ignoring middle
                            
                                Most efficient way to turn all caps into title case with JS or JQuery? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to make Spark session read all the files recursively?

Tags:

regex

recursion

scala

apache-spark

Saurav Sahu

People also ask

1 Answers

Update

blackbishop

Recent Activity

Donate For Us