Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explicit cast reading .csv with case class Spark 2.1.0

I have the following case class:

case class OrderDetails(OrderID : String, ProductID : String, UnitPrice : Double,
                    Qty : Int, Discount : Double)

I am trying read this csv: https://github.com/xsankar/fdps-v3/blob/master/data/NW-Order-Details.csv

This is my code:

val spark = SparkSession.builder.master(sparkMaster).appName(sparkAppName).getOrCreate()
import spark.implicits._
val orderDetails = spark.read.option("header","true").csv( inputFiles + "NW-Order-Details.csv").as[OrderDetails]

And the error is:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: 
 Cannot up cast `UnitPrice` from string to double as it may truncate
 The type path of the target object is:
  - field (class: "scala.Double", name: "UnitPrice")
  - root class: "es.own3dh2so4.OrderDetails"
 You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;

Why can not it be transformed if all fields are "doubles" values? What do not I understand?

Spark version 2.1.0, Scala version 2.11.7

like image 898
own3dh2so4 Avatar asked Apr 02 '17 14:04

own3dh2so4


People also ask

What is inferSchema true?

inferSchema -> Infer schema will automatically guess the data types for each field. If we set this option to TRUE, the API will read some sample records from the file to infer the schema. If we want to set this value to false, we must specify a schema explicitly.

What option can be used to automatically infer the datatype of column?

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes.

Which option can be used in Spark SQL if you need to use an in memory columnar structure to cache tables?

Caching Data In Memory Spark SQL can cache tables using an in-memory columnar format by calling spark. catalog. cacheTable("tableName") or dataFrame. cache() .


1 Answers

You just need to explicitly cast your field to a Double:

val orderDetails = spark.read
   .option("header","true")
   .csv( inputFiles + "NW-Order-Details.csv")
   .withColumn("unitPrice", 'UnitPrice.cast(DoubleType))
   .as[OrderDetails]

On a side note, by Scala (and Java) convention, your case class constructor parameters should be lower camel case:

case class OrderDetails(orderID: String, 
                        productID: String, 
                        unitPrice: Double,
                        qty: Int, 
                        discount: Double)
like image 199
Vidya Avatar answered Oct 04 '22 18:10

Vidya