Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert column of unix epoch to Date in Apache spark DataFrame using Java?

I have a json data file which contain one property [creationDate] which is unix epoc in "long" number type. The Apache Spark DataFrame schema look like below:

root 
 |-- creationDate: long (nullable = true) 
 |-- id: long (nullable = true) 
 |-- postTypeId: long (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title: string (nullable = true)
 |-- viewCount: long (nullable = true)

I would like to do some groupBy "creationData_Year" which need to get from "creationDate".

What's the easiest way to do this kind of convert in DataFrame using Java?

like image 996
ErhWen Kuo Avatar asked Jan 06 '16 05:01

ErhWen Kuo


People also ask

How do I change my epoch time to date on Spark?

from_unixtime() SQL function is used to convert or cast Epoch time to timestamp string and this function takes Epoch time as a first argument and formatted string time as the second argument. As a first argument, we use unix_timestamp() which returns the current timestamp in Epoch time (Long) as an argument.

How do I change my epoch value to date?

Convert from epoch to human-readable dateString date = new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").format(new java.util.Date (epoch*1000)); Epoch in seconds, remove '*1000' for milliseconds. myString := DateTimeToStr(UnixToDateTime(Epoch)); Where Epoch is a signed integer. Replace 1526357743 with epoch.

How do I change the datatype of a column in spark data frame?

To change the Spark SQL DataFrame column type from one data type to another data type you should use cast() function of Column class, you can use this on withColumn(), select(), selectExpr(), and SQL expression.


2 Answers

After checking spark dataframe api and sql function, I come out below snippet:

DateFrame df = sqlContext.read().json("MY_JSON_DATA_FILE");

DataFrame df_DateConverted = df.withColumn("creationDt", from_unixtime(stackoverflow_Tags.col("creationDate").divide(1000)));

The reason why "creationDate" column is divided by "1000" is cause the TimeUnit is different. The orgin "creationDate" is unix epoch in "milli-second", however spark sql "from_unixtime" is designed to handle unix epoch in "second".

like image 102
ErhWen Kuo Avatar answered Oct 18 '22 09:10

ErhWen Kuo


pyspark converts from Unix epoch milliseconds to dataframe timestamp

df.select(from_unixtime((df.my_date_column.cast('bigint')/1000)).cast('timestamp').alias('my_date_column'))
like image 20
Ray Metz Avatar answered Oct 18 '22 09:10

Ray Metz