Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL: automatic schema from csv

does spark sql provide any way to automatically load csv data? I found the following Jira: https://issues.apache.org/jira/browse/SPARK-2360 but it was closed....

Currently I would load a csv file as follows:

case class Record(id: String, val1: String, val2: String, ....)

 sc.textFile("Data.csv")
.map(_.split(",")) 
.map { r =>                  
   Record(r(0),r(1), .....)
}.registerAsTable("table1")

Any hints on the automatic schema deduction from csv files? In particular a) how can I generate a class representing the schema and b) how can I automatically fill it (i.e. Record(r(0),r(1), .....))?

Update: I found a partial answer to the schema generation here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#data-sources

// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val schema =
 StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

So the only question left would be how to do the step map(p => Row(p(0), p(1).trim)) dynamically for the given number of attributes?

Thanks for your support! Joerg

like image 620
js84 Avatar asked Nov 17 '14 09:11

js84


1 Answers

You can use spark-csv where you can save a few keystrokes not having to define the column names and auto-use the headers.

like image 85
dimitrisli Avatar answered Oct 15 '22 21:10

dimitrisli