Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?

df = sqlContext.read        .format("com.databricks.spark.csv")        .option("header", "true")        .load("../Downloads/2008.csv") 
like image 847
Chendur Avatar asked Jun 05 '16 08:06

Chendur


2 Answers

Use wildcard, e.g. replace 2008 with *:

df = sqlContext.read        .format("com.databricks.spark.csv")        .option("header", "true")        .load("../Downloads/*.csv") // <-- note the star (*) 

Spark 2.0

// these lines are equivalent in Spark 2.0 spark.read.format("csv").option("header", "true").load("../Downloads/*.csv") spark.read.option("header", "true").csv("../Downloads/*.csv") 

Notes:

  1. Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.

  2. Use spark not sqlContext

like image 173
Yaron Avatar answered Oct 11 '22 13:10

Yaron


Ex1:

Reading a single CSV file. Provide complete file path:

 val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv") 

Ex2:

Reading multiple CSV files passing names:

val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv") 

Ex3:

Reading multiple CSV files passing list of names:

val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv") val df = spark.read.option("header", "true").csv(paths: _*) 

Ex4:

Reading multiple CSV files in a folder ignoring other files:

val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv") 

Ex5:

Reading multiple CSV files from multiple folders:

val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1") val df = spark.read.option("header", "true").csv(folders: _*) 
like image 36
mputha Avatar answered Oct 11 '22 12:10

mputha