I want to recursively read all csv files in a given folder into a Spark SQL DataFrame
using a single path, if possible.
My folder structure looks something like this and I want to include all of the files with one path:
resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv
This is my code:
def read: DataFrame =
sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.option("charset", "UTF-8")
.csv(path)
Setting path
to .../resource/*/*.csv
omits 1. while .../resource/*.csv
omits 2. and 3.
I know csv()
also takes multiple strings as path arguments, but want to avoid that, if possible.
note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.
If there are only csv files and only one level of subfolder in your resources
directory then you can use resources/**
.
EDIT
Else you can use Hadoop FileSystem
class to recursively list every csv files in your resources
directory and then pass the list to .csv()
val fs = FileSystem.get(new Configuration())
val files = fs.listFiles(new Path("resources/", true))
val filePaths = new ListBuffer[String]
while (files.hasNext()) {
val file = files.next()
filePaths += file.getPath.toString
}
val df: DataFrame = spark
.read
.options(...)
.csv(filePaths: _*)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With