Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change the column type from String to Date in DataFrames?

I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date

like image 255
dbspace Avatar asked Apr 29 '16 21:04

dbspace


People also ask

How do I convert a string to a date?

Using strptime() , date and time in string format can be converted to datetime type. The first parameter is the string and the second is the date time format specifier. One advantage of converting to date format is one can select the month or date or time individually.

How do I convert a string to a date in pandas?

Use pandas.pandas. to_datetime() method is used to change String/Object time to date type (datetime64[ns]). This method is smart enough to change different formats of the String date column to date.

How do you change the datatype of an object to a date in Python?

The date column is indeed a string, which—remember—is denoted as an object type in Python. You can convert it to the datetime type with the . to_datetime() method in pandas .


1 Answers

Spark >= 2.2

You can use to_date:

import org.apache.spark.sql.functions.{to_date, to_timestamp}

df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))

or to_timestamp:

df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))

with intermediate unix_timestamp call.

Spark < 2.2

Since Spark 1.5 you can use unix_timestamp function to parse string to long, cast it to timestamp and truncate to_date:

import org.apache.spark.sql.functions.{unix_timestamp, to_date}

val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")

df.select(to_date(unix_timestamp(
  $"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))

Note:

Depending on a Spark version you this may require some adjustments due to SPARK-11724:

Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.

If you use unpatched version unix_timestamp output requires multiplication by 1000.

like image 70
zero323 Avatar answered Sep 20 '22 21:09

zero323