Spark, Scala - column type determine

Tags:

apache-spark

I can load data from database, and I do some process with this data. The problem is some table has date column as 'String', but some others trait it as 'timestamp'.

I cannot know what type of date column is until loading data.

> x.getAs[String]("date") // could be error when date column is timestamp type
> x.getAs[Timestamp]("date") // could be error when date column is string type

This is how I load data from spark.

spark.read
              .format("jdbc")
              .option("url", url)
              .option("dbtable", table)
              .option("user", user)
              .option("password", password)
              .load()

Is there any way to trait them together? or convert it as string always?

372

asked Dec 27 '16 08:12

2 Answers

You can pattern-match on the type of the column (using the DataFrame's schema) to decide whether to parse the String into a Timestamp or just use the Timestamp as is - and use the unix_timestamp function to do the actual conversion:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType

// preparing some example data - df1 with String type and df2 with Timestamp type
val df1 = Seq(("a", "2016-02-01"), ("b", "2016-02-02")).toDF("key", "date")
val df2 = Seq(
  ("a", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-01").getTime)),
  ("b", new Timestamp(new SimpleDateFormat("yyyy-MM-dd").parse("2016-02-02").getTime))
).toDF("key", "date")

// If column is String, converts it to Timestamp
def normalizeDate(df: DataFrame): DataFrame = {
  df.schema("date").dataType match {
    case StringType => df.withColumn("date", unix_timestamp($"date", "yyyy-MM-dd").cast("timestamp"))
    case _ => df
  }
}

// after "normalizing", you can assume date has Timestamp type - 
// both would print the same thing:
normalizeDate(df1).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)
normalizeDate(df2).rdd.map(r => r.getAs[Timestamp]("date")).foreach(println)

102

answered Oct 03 '22 17:10

Tzach Zohar

Here are a few things you can try:

(1) Start utilizing the inferSchema function during load if you have a version that supports it. This will have spark figure the data type of columns, this doesn't work in all scenarios. Also look at the input data, if you have quotes I advise adding an extra argument to account for them during the load.

val inputDF = spark.read.format("csv").option("header","true").option("inferSchema","true").load(fileLocation)

(2) To identify the data type of a column you can use the below code, it will place all of the column name and data types into their own Arrays of Strings.

val columnNames : Array[String] = inputDF.columns
val columnDataTypes : Array[String] = inputDF.schema.fields.map(x=>x.dataType).map(x=>x.toString)

answered Oct 03 '22 17:10

afeldman

Related questions
                            
                                Lifting methods to function values in Scala
                            
                                Why is Scala's behavior in case of overloading with by-name parameters different from the case with by-value parameters?
                            
                                Test if implicit conversion is available
                            
                                java.nio.charset.MalformedInputException when reading a stream
                            
                                Reducing a large stream without stack overflowing
                            
                                "Could not find main method from given launch configuration" when using Java+Scala+Slick2D
                            
                                Match a tuple of unknown size in scala
                            
                                ClassTag based pattern matching fails for primitives
                            
                                Standalone deployment of Scalatra servlet
                            
                                How to implement the `List` monad transformer in Scala?
                            
                                How do i specify spray Content-Type response header?
                            
                                ScalaFx: Event Handler with First Class Function
                            
                                flatMap on a map gives error: wrong number of parameters; expected = 1
                            
                                How to Check whether input variable is Int in Scala?
                            
                                How to find and modify field in nested case classes?
                            
                                How does map() on 'zipped' Lists work?
                            
                                Understanding Scala code: (-_._2)
                            
                                Difference between transparent remoting and location transparency
                            
                                Scala Dataframe null check for columns
                            
                                akka http compile error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark, Scala - column type determine

Tags:

scala

apache-spark

J.Done

People also ask

2 Answers

Tzach Zohar

afeldman

Recent Activity

Donate For Us