I want to recursively read all csv files in a given folder into a Spark SQL DataFrame using a single path, if possible.
My folder structure looks something like this and I want to include all of the files with one path:
resources/first.csvresources/subfolder/second.csvresources/subfolder/third.csvThis is my code:
def read: DataFrame =
      sparkSession
        .read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("charset", "UTF-8")
        .csv(path)
Setting path to .../resource/*/*.csv omits 1. while .../resource/*.csv omits 2. and 3.
I know csv() also takes multiple strings as path arguments, but want to avoid that, if possible.
note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.
If there are only csv files and only one level of subfolder in your resources directory then you can use resources/**.
EDIT
Else you can use Hadoop FileSystem class to recursively list every csv files in your resources directory and then pass the list to .csv()
    val fs = FileSystem.get(new Configuration())
    val files = fs.listFiles(new Path("resources/", true))
    val filePaths = new ListBuffer[String]
    while (files.hasNext()) {
        val file = files.next()
        filePaths += file.getPath.toString
    }
    val df: DataFrame = spark
        .read
        .options(...)
        .csv(filePaths: _*)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With