How to specify schema for CSV file without using Scala case class?

Tags:

I am loading a CSV file into a DataFrame as below.

val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val df = spark.
  read.  
  format("org.apache.spark.csv").
  option("header", true).
  csv("/home/cloudera/Book1.csv")
scala> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- age: string (nullable = true)

How to change age column to be of type Int?

546

asked Nov 17 '16 11:11

2 Answers

Given val spark=SparkSession.builder().getOrCreate() I guess you're using Spark 2.x.

First of all, please note that Spark 2.x has a native support for CSV format and as such does not require specifying the format by its long name, i.e. org.apache.spark.csv, but just csv.

spark.read.format("csv")...

Since you use csv operator, the CSV format is implied and so you can skip/remove format("csv").

// note that I removed format("csv")
spark.read.option("header", true).csv("/home/cloudera/Book1.csv")

With that you have plenty of options, but I strongly recommend using a case class for...just the schema. See the last solution if you're curious how to do it in Spark 2.0.

cast operator

You could use cast operator.

scala> Seq("1").toDF("str").withColumn("num", 'str cast "int").printSchema
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Using StructType

You can also use your own hand-crafted schema with StructType and StructField as follows:

import org.apache.spark.sql.types._    
val schema = StructType(
  StructField("str", StringType, true) :: 
  StructField("num", IntegerType, true) :: Nil)

scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

val q = spark.
  read.
  option("header", true).
  schema(schema).
  csv("numbers.csv")
scala> q.printSchema
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Schema DSL

What I found quite interesting lately was so-called Schema DSL. The above schema built using StructType and StructField can be re-written as follows:

import org.apache.spark.sql.types._
val schema = StructType(
  $"str".string ::
  $"num".int :: Nil) 
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

// or even
val schema = new StructType().
  add($"str".string).
  add($"num".int)
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = true)

Encoders

Encoders are so easy to use that it's hard to believe you could not want them, even only to build a schema without dealing with StructType, StructField and DataType.

// Define a business object that describes your dataset
case class MyRecord(str: String, num: Int)

// Use Encoders object to create a schema off the business object
import org.apache.spark.sql.Encoders    
val schema = Encoders.product[MyRecord].schema
scala> schema.printTreeString
root
 |-- str: string (nullable = true)
 |-- num: integer (nullable = false)

192

answered Sep 19 '22 15:09

Jacek Laskowski

There is inferSchema option to automatically recognize the type of the variable by:

val df=spark.read
  .format("org.apache.spark.csv")
  .option("header", true)
  .option("inferSchema", true) // <-- HERE
  .csv("/home/cloudera/Book1.csv")

spark-csv originally was an external library by databricks, but included in core spark from spark version 2.0 onwards. You can refer to documentation on the library's github page to find the available options.

answered Sep 21 '22 15:09

vdep

Related questions
                            
                                About generics in Java and Scala
                            
                                How to substitute an empty string (or null) with a default string concisely in Scala
                            
                                Scala constructor without parameters
                            
                                Scala Listener/Observer
                            
                                String interpolation in Scala?
                            
                                Akka: Send a future message to an Actor
                            
                                Scala: Draw table to console
                            
                                Reading a CSV files using Akka Streams
                            
                                use length function in substring in spark
                            
                                When to use parenthesis in Scala infix notation
                            
                                Is there way to create tuple from list(without codegeneration)?
                            
                                Why is the + operator for List deprecated in Scala?
                            
                                What second language to use besides Scala for LowLevel? [closed]
                            
                                Best way to represent a readline loop in Scala?
                            
                                How to return all positives and the first negative number in a list using functional programming?
                            
                                List of options: equivalent of sequence in Scala?
                            
                                How to find max value in pair RDD?
                            
                                create substring column in spark dataframe
                            
                                Scala: getting the key (and value) of a Map.head element
                            
                                Getting values from Map given list of keys in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to specify schema for CSV file without using Scala case class?

Tags:

scala

apache-spark

apache-spark-sql

Ishan Kumar

People also ask