Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load and process multiple csv files from a DBFS directory with Spark

I want to run the following code on each file that I read from DBFS (Databricks FileSystem). I tested it on all files that are in a folder, but I want to make similar calculations for each file in the folder, one by one:

// a-e are calculated fields
val df2=Seq(("total",a,b,c,d,e)).toDF("file","total","count1","count2","count3","count4")

//schema is now an empty dataframe
val final1 = schema.union(df2)

Is that possible? I guess reading it from dbfs should be done differently as well, from what I do now:

val df1 = spark
      .read
      .format("csv")
      .option("header", "true")
      .option("delimiter",",")
      .option("inferSchema", "true")
      .load("dbfs:/Reports/*.csv")
      .select("lot of ids")

Thank you a lot in advance for the ideas :)

like image 390
Eve Avatar asked Oct 19 '25 09:10

Eve


1 Answers

As discussed you have 3 options here.

In my example I am using the next 3 datasets:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |100 |200 |
|2   |300 |400 |
+----+----+----+

+----+----+----+
|col1|col2|col3|
+----+----+----+
|3   |60  |80  |
|4   |12  |100 |
|5   |20  |10  |
+----+----+----+

+----+----+----+
|col1|col2|col3|
+----+----+----+
|7   |20  |40  |
|8   |30  |40  |
+----+----+----+

You create first you schema (is faster to define the schema explicitly instead of inferring it):

import org.apache.spark.sql.types._

val df_schema =
  StructType(
    List(
        StructField("col1", IntegerType, true),
        StructField("col2", IntegerType, true),
        StructField("col3", IntegerType, true)))

Option 1:

Load all CSVs at once with:

val df1 = spark
      .read
      .option("header", "false")
      .option("delimiter", ",")
      .option("inferSchema", "false")
      .schema(df_schema)
      .csv("file:///C:/data/*.csv")

Then apply your logic to the whole dataset grouping by the file name.

Precondition: You must find a way to append the file name to each file

Option 2:

Load csv files from directory. Then iterate over the files and create a dataframe for each csv. Inside the loop apply your logic to each csv. Finally in the end of the loop append (union) the results into a 2nd dataframe which will store your accumulated results.

Attention: Please be aware that a large number of files might cause a very big DAG and subsequently a huge execution plan, in order to avoid this you can persist the current results or call collect. In the example below I assumed that persist or collect will get executed for every bufferSize iterations. You can adjust or even remove this logic according to the number of csv files.

This is a sample code for the 2nd option:

import java.io.File
import org.apache.spark.sql.Row
import spark.implicits._

val dir = "C:\\data_csv\\"
val csvFiles = new File(dir).listFiles.filter(_.getName.endsWith(".csv"))

val bufferSize = 10
var indx = 0
//create an empty df which will hold the accumulated results
var bigDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], df_schema)
csvFiles.foreach{ path => 
    var tmp_df = spark
                  .read
                  .option("header", "false")
                  .option("delimiter", ",")
                  .option("inferSchema", "false")
                  .schema(df_schema)
                  .csv(path.getPath)

    //execute your custom logic/calculations with tmp_df

    if((indx + 1) % bufferSize == 0){
        // If buffer size reached then
        // 1. call unionDf.persist() or unionDf.collect()
        // 2. in the case you use collect() load results into unionDf again 
    }

    bigDf = bigDf.union(tmp_df)
    indx = indx + 1
}
bigDf.show(false)

This should output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |100 |200 |
|2   |300 |400 |
|3   |60  |80  |
|4   |12  |100 |
|5   |20  |10  |
|7   |20  |40  |
|8   |30  |40  |
+----+----+----+

Option 3:

The last option is to use the build-in spark.sparkContext.wholeTextFiles.

This is the code to load all csv files into a RDD:

val data = spark.sparkContext.wholeTextFiles("file:///C:/data_csv/*.csv")
val df = spark.createDataFrame(data)

df.show(false)

And the output:

+--------------------------+--------------------------+
|_1                        |_2                        |
+--------------------------+--------------------------+
|file:/C:/data_csv/csv1.csv|1,100,200                 |
|                          |2,300,400                 |
|file:/C:/data_csv/csv2.csv|3,60,80                   |
|                          |4,12,100                  |
|                          |5,20,10                   |
|file:/C:/data_csv/csv3.csv|7,20,40                   |
|                          |8,30,40                   |
+--------------------------+--------------------------+

spark.sparkContext.wholeTextFiles will return a key/value RDD in which key is the file path and value is the file data.

This requires extra code to extract the content of the _2 which is the content of each csv. In my opinion this would consist an overhead regarding the performance and the maintainability of the program therefore I would have avoided it.

Let me know if you need further clarifications

like image 139
abiratsis Avatar answered Oct 22 '25 01:10

abiratsis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!