Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows:

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)] 

It's schema is defined as:

elevDF.printSchema()  root  |-- date: timestamp (nullable = true)  |-- hour: long (nullable = true)  |-- value: double (nullable = true) 

How do I get the Year, Month, Day values from the 'date' field?

like image 799
curtisp Avatar asked Jun 20 '15 00:06

curtisp


People also ask

How do I get the month from date in spark?

In order to get month, year and quarter from pyspark we will be using month(), year() and quarter() function respectively. year() Function with column name as argument extracts year from date in pyspark. month() Function with column name as argument extracts month from date in pyspark.

How does spark determine date format in DataFrame?

Use option("dateFormat", "MM/dd/yyyy") to validate date field in dataframe.It will discard the invalid rows.

How do you extract days in Pyspark?

Extract Day of the year from date in pyspark – Method 2date_format() Function with column name and “D” (upper case D) as argument extracts day of the year from date in pyspark and stored in the column name “D_O_Y” as shown below.


2 Answers

Since Spark 1.5 you can use a number of date processing functions:

  • pyspark.sql.functions.year
  • pyspark.sql.functions.month
  • pyspark.sql.functions.dayofmonth
  • pyspark.sql.functions.dayofweek()
  • pyspark.sql.functions.dayofyear
  • pyspark.sql.functions.weekofyear()

import datetime from pyspark.sql.functions import year, month, dayofmonth  elevDF = sc.parallelize([     (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55) ]).toDF(["date", "hour", "value"])  elevDF.select(     year("date").alias('year'),      month("date").alias('month'),      dayofmonth("date").alias('day') ).show() # +----+-----+---+ # |year|month|day| # +----+-----+---+ # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # +----+-----+---+ 

You can use simple map as with any other RDD:

elevDF = sqlContext.createDataFrame(sc.parallelize([         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))  (elevDF  .map(lambda (date, hour, value): (date.year, date.month, date.day))  .collect()) 

and the result is:

[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)] 

Btw: datetime.datetime stores an hour anyway so keeping it separately seems to be a waste of memory.

like image 57
zero323 Avatar answered Sep 22 '22 22:09

zero323


You can use functions in pyspark.sql.functions: functions like year, month, etc

refer to here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

from pyspark.sql.functions import *  newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))  newdf.show()   +-------+--------+------+---------+-------+------+----------+----------+ |dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no|    dt_int| +-------+--------+------+---------+-------+------+----------+----------+ |   2015|       9|     6|      249|      0|     0|        36|1441497601| |   2015|       9|     6|      249|      0|     0|        36|1441497601| |   2015|       9|     6|      249|      0|     0|        36|1441497603| |   2015|       9|     6|      249|      0|     1|        36|1441497694| |   2015|       9|     6|      249|      0|    20|        36|1441498808| |   2015|       9|     6|      249|      0|    20|        36|1441498811| |   2015|       9|     6|      249|      0|    20|        36|1441498815| 
like image 43
hamed Avatar answered Sep 19 '22 22:09

hamed