Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert string data in dataframe into double

I have a csv file containing double type.When i load to a dataframe i got this message telling me that the type string is java.lang.String cannot be cast to java.lang.Double although my data are numeric.How do i get a dataframe from this csv file containing double type.how should i modify my code.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._

object Example extends App {

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")

What sould i make to transform each row in the dataframe into double type? Thanks

like image 810
Hattabi Maher Avatar asked Feb 05 '23 10:02

Hattabi Maher


2 Answers

Use select with cast:

import org.apache.spark.sql.functions.col

data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
  c => col(c).cast("double")
): _*)

or pass schema to the reader:

  • define the schema:

    import org.apache.spark.sql.types._
    
    val cols = Seq(
      "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
    )
    
    val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")
    
    val schema =  StructType(cols.map(
      c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
    ))
    
  • and use it as an argument for schema method

    spark.read.schema(schema).csv(path)
    

It is also possible to use schema inference:

spark.read.option("inferSchema", "true").csv(path)

but it is much more expensive.

like image 70
zero323 Avatar answered Feb 08 '23 01:02

zero323


I believe using sparks inferSchema option comes in handy while reading the csv file. Below is the code to automatically detect your columns as double type :

val data = spark.read
                .format("csv")
                .option("header", "false")
                .option("inferSchema", "true")
                .load("C://lpsa.data").toDF()


Note: I am using spark version 2.2.0 
like image 24
Saurabh Singh Avatar answered Feb 08 '23 00:02

Saurabh Singh