I have Spark DataFrame with take(5) top rows as follows:
[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]
It's schema is defined as:
elevDF.printSchema() root |-- date: timestamp (nullable = true) |-- hour: long (nullable = true) |-- value: double (nullable = true)
How do I get the Year, Month, Day values from the 'date' field?
In order to get month, year and quarter from pyspark we will be using month(), year() and quarter() function respectively. year() Function with column name as argument extracts year from date in pyspark. month() Function with column name as argument extracts month from date in pyspark.
Use option("dateFormat", "MM/dd/yyyy") to validate date field in dataframe.It will discard the invalid rows.
Extract Day of the year from date in pyspark – Method 2date_format() Function with column name and “D” (upper case D) as argument extracts day of the year from date in pyspark and stored in the column name “D_O_Y” as shown below.
Since Spark 1.5 you can use a number of date processing functions:
pyspark.sql.functions.year
pyspark.sql.functions.month
pyspark.sql.functions.dayofmonth
pyspark.sql.functions.dayofweek()
pyspark.sql.functions.dayofyear
pyspark.sql.functions.weekofyear()
import datetime from pyspark.sql.functions import year, month, dayofmonth elevDF = sc.parallelize([ (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55), (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55) ]).toDF(["date", "hour", "value"]) elevDF.select( year("date").alias('year'), month("date").alias('month'), dayofmonth("date").alias('day') ).show() # +----+-----+---+ # |year|month|day| # +----+-----+---+ # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # |1984| 1| 1| # +----+-----+---+
You can use simple map
as with any other RDD:
elevDF = sqlContext.createDataFrame(sc.parallelize([ Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)])) (elevDF .map(lambda (date, hour, value): (date.year, date.month, date.day)) .collect())
and the result is:
[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]
Btw: datetime.datetime
stores an hour anyway so keeping it separately seems to be a waste of memory.
You can use functions in pyspark.sql.functions
: functions like year
, month
, etc
refer to here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame
from pyspark.sql.functions import * newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int')) newdf.show() +-------+--------+------+---------+-------+------+----------+----------+ |dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no| dt_int| +-------+--------+------+---------+-------+------+----------+----------+ | 2015| 9| 6| 249| 0| 0| 36|1441497601| | 2015| 9| 6| 249| 0| 0| 36|1441497601| | 2015| 9| 6| 249| 0| 0| 36|1441497603| | 2015| 9| 6| 249| 0| 1| 36|1441497694| | 2015| 9| 6| 249| 0| 20| 36|1441498808| | 2015| 9| 6| 249| 0| 20| 36|1441498811| | 2015| 9| 6| 249| 0| 20| 36|1441498815|
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With