Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

Tags:

I have Spark DataFrame with take(5) top rows as follows:

[Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),  Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]

It's schema is defined as:

elevDF.printSchema()  root  |-- date: timestamp (nullable = true)  |-- hour: long (nullable = true)  |-- value: double (nullable = true)

How do I get the Year, Month, Day values from the 'date' field?

799

asked Jun 20 '15 00:06

curtisp

2 Answers

Since Spark 1.5 you can use a number of date processing functions:

pyspark.sql.functions.year
pyspark.sql.functions.month
pyspark.sql.functions.dayofmonth
pyspark.sql.functions.dayofweek()
pyspark.sql.functions.dayofyear
pyspark.sql.functions.weekofyear()

import datetime from pyspark.sql.functions import year, month, dayofmonth  elevDF = sc.parallelize([     (datetime.datetime(1984, 1, 1, 0, 0), 1, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 2, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 3, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 4, 638.55),     (datetime.datetime(1984, 1, 1, 0, 0), 5, 638.55) ]).toDF(["date", "hour", "value"])  elevDF.select(     year("date").alias('year'),      month("date").alias('month'),      dayofmonth("date").alias('day') ).show() # +----+-----+---+ # |year|month|day| # +----+-----+---+ # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # |1984|    1|  1| # +----+-----+---+

You can use simple map as with any other RDD:

elevDF = sqlContext.createDataFrame(sc.parallelize([         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=3, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=4, value=638.55),         Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=5, value=638.55)]))  (elevDF  .map(lambda (date, hour, value): (date.year, date.month, date.day))  .collect())

and the result is:

[(1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1), (1984, 1, 1)]

Btw: datetime.datetime stores an hour anyway so keeping it separately seems to be a waste of memory.

answered Sep 22 '22 22:09

zero323

You can use functions in pyspark.sql.functions: functions like year, month, etc

refer to here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

from pyspark.sql.functions import *  newdf = elevDF.select(year(elevDF.date).alias('dt_year'), month(elevDF.date).alias('dt_month'), dayofmonth(elevDF.date).alias('dt_day'), dayofyear(elevDF.date).alias('dt_dayofy'), hour(elevDF.date).alias('dt_hour'), minute(elevDF.date).alias('dt_min'), weekofyear(elevDF.date).alias('dt_week_no'), unix_timestamp(elevDF.date).alias('dt_int'))  newdf.show()   +-------+--------+------+---------+-------+------+----------+----------+ |dt_year|dt_month|dt_day|dt_dayofy|dt_hour|dt_min|dt_week_no|    dt_int| +-------+--------+------+---------+-------+------+----------+----------+ |   2015|       9|     6|      249|      0|     0|        36|1441497601| |   2015|       9|     6|      249|      0|     0|        36|1441497601| |   2015|       9|     6|      249|      0|     0|        36|1441497603| |   2015|       9|     6|      249|      0|     1|        36|1441497694| |   2015|       9|     6|      249|      0|    20|        36|1441498808| |   2015|       9|     6|      249|      0|    20|        36|1441498811| |   2015|       9|     6|      249|      0|    20|        36|1441498815|

answered Sep 19 '22 22:09

hamed

Related questions
                            
                                Catching a 500 server error in Flask
                            
                                What's the purpose of the "__package__" attribute in Python?
                            
                                BeatifulSoup4 get_text still has javascript
                            
                                Visual Studio Code - How to add multiple paths to python path?
                            
                                How to get a list of built-in modules in python?
                            
                                Python: Read several json files from a folder
                            
                                preprocess_input() method in keras
                            
                                How to customize the auth.User Admin page in Django CRUD?
                            
                                Creating HTML in python
                            
                                plotting results of hierarchical clustering ontop of a matrix of data in python
                            
                                Postpone code for later execution in python (like setTimeout in javascript) [duplicate]
                            
                                How to add column to numpy array
                            
                                Unsupported operation :not writeable python
                            
                                syntax error when using command line in python
                            
                                confidence and prediction intervals with StatsModels
                            
                                AttributeError: 'Flask' object has no attribute 'user_options'
                            
                                python pip on Windows - command 'cl.exe' failed
                            
                                Plot a histogram from a Dictionary
                            
                                How do you merge images into a canvas using PIL/Pillow?
                            
                                @Patch decorator is not compatible with pytest fixture

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

Tags:

python

timestamp

apache-spark

pyspark

curtisp

People also ask

2 Answers

zero323

hamed

Recent Activity

Donate For Us