I am reading DataFrame from CSV file, where first column is an event date and time e.g.
2016-08-08 07:45:28+03
In code below, is it possible to specify within schema definition how to convert such strings into date?
val df:DataFrame = spark.read.options(Map(
"header" -> "true"
)).schema(StructType(
StructField("EventTime", DataTypes.DateType, false) ::
Nil
)).csv("C:/qos1h.csv")
This code fails with
java.lang.NumberFormatException: For input string: "28+03"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at java.sql.Timestamp.valueOf(Timestamp.java:259)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:135)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:291)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:115)
at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:84)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$1.apply(CSVFileFormat.scala:125)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$1.apply(CSVFileFormat.scala:124)
It looks like that it is impossible to specify string to date conversion in schema definition. However in DataFrameReader.csv method documentation there is information how to specify date format string via options for every DateType field.
Here is the fixed code:
val df:DataFrame = spark.read.options(Map(
"header" -> "true",
"dateFormat" -> "yyyy-MM-dd HH:mm:ssX"
)).schema(StructType(
StructField("EventTime", DataTypes.DateType, false) ::
Nil
)).csv("C:/qos1h.csv")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With