I am loading a CSV file into a DataFrame as below.
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val df = spark.
read.
format("org.apache.spark.csv").
option("header", true).
csv("/home/cloudera/Book1.csv")
scala> df.printSchema()
root
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- age: string (nullable = true)
How to change age
column to be of type Int
?
To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing all file names comma separated. The below example reads text01. csv & text02. csv files into single RDD.
Given val spark=SparkSession.builder().getOrCreate()
I guess you're using Spark 2.x.
First of all, please note that Spark 2.x has a native support for CSV format and as such does not require specifying the format by its long name, i.e. org.apache.spark.csv
, but just csv
.
spark.read.format("csv")...
Since you use csv
operator, the CSV format is implied and so you can skip/remove format("csv")
.
// note that I removed format("csv")
spark.read.option("header", true).csv("/home/cloudera/Book1.csv")
With that you have plenty of options, but I strongly recommend using a case class for...just the schema. See the last solution if you're curious how to do it in Spark 2.0.
You could use cast operator.
scala> Seq("1").toDF("str").withColumn("num", 'str cast "int").printSchema
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
You can also use your own hand-crafted schema with StructType and StructField as follows:
import org.apache.spark.sql.types._
val schema = StructType(
StructField("str", StringType, true) ::
StructField("num", IntegerType, true) :: Nil)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
val q = spark.
read.
option("header", true).
schema(schema).
csv("numbers.csv")
scala> q.printSchema
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
What I found quite interesting lately was so-called Schema DSL. The above schema built using StructType
and StructField
can be re-written as follows:
import org.apache.spark.sql.types._
val schema = StructType(
$"str".string ::
$"num".int :: Nil)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
// or even
val schema = new StructType().
add($"str".string).
add($"num".int)
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = true)
Encoders are so easy to use that it's hard to believe you could not want them, even only to build a schema without dealing with StructType
, StructField
and DataType
.
// Define a business object that describes your dataset
case class MyRecord(str: String, num: Int)
// Use Encoders object to create a schema off the business object
import org.apache.spark.sql.Encoders
val schema = Encoders.product[MyRecord].schema
scala> schema.printTreeString
root
|-- str: string (nullable = true)
|-- num: integer (nullable = false)
There is inferSchema
option to automatically recognize the type of the variable by:
val df=spark.read
.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true) // <-- HERE
.csv("/home/cloudera/Book1.csv")
spark-csv
originally was an external library by databricks, but included in core spark from spark version 2.0 onwards. You can refer to documentation on the library's github page to find the available options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With