PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

Tags:

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from parquet as a String. I then try to convert it to Timestamp via

canon_evt = canon_evt.withColumn('dt',to_date(canon_evt.dt))
canon_evt= canon_evt.withColumn('dt',canon_evt.dt.astype('Timestamp'))

Then I would like to remove the seconds. I tried 'trunc', 'date_format' or even trying to concatenate pieces together like below. I think it requires some sort of map and lambda combination, but I'm not certain whether Timestamp is an appropriate format, and whether it's possible to get rid of seconds.

canon_evt = canon_evt.withColumn('dyt',year('dt') + '-' + month('dt') +
    '-' + dayofmonth('dt') + ' ' + hour('dt') + ':' + minute('dt'))

[Row(dt=datetime.datetime(2015, 9, 16, 0, 0),dyt=None)]

563

asked Dec 11 '15 20:12

PR102012

2 Answers

Spark >= 2.3

You can use date_trunc

df.withColumn("dt_truncated", date_trunc("minute", col("dt"))).show()

## +-------------------+-------------------+
## |                 dt|       dt_truncated|
## +-------------------+-------------------+
## |1970-01-01 00:00:00|1970-01-01 00:00:00|
## |2015-09-16 05:39:46|2015-09-16 05:39:00|
## |2015-09-16 05:40:46|2015-09-16 05:40:00|
## |2016-03-05 02:00:10|2016-03-05 02:00:00|
## +-------------------+-------------------+

Spark < 2.3

Converting to Unix timestamps and basic arithmetics should to the trick:

from pyspark.sql import Row
from pyspark.sql.functions import col, unix_timestamp, round

df = sc.parallelize([
    Row(dt='1970-01-01 00:00:00'),
    Row(dt='2015-09-16 05:39:46'),
    Row(dt='2015-09-16 05:40:46'),
    Row(dt='2016-03-05 02:00:10'),
]).toDF()


## unix_timestamp converts string to Unix timestamp (bigint / long)
## in seconds. Divide by 60, round, multiply by 60 and cast
## should work just fine.
## 
dt_truncated = ((round(unix_timestamp(col("dt")) / 60) * 60)
    .cast("timestamp"))

df.withColumn("dt_truncated", dt_truncated).show(10, False)
## +-------------------+---------------------+
## |dt                 |dt_truncated         |
## +-------------------+---------------------+
## |1970-01-01 00:00:00|1970-01-01 00:00:00.0|
## |2015-09-16 05:39:46|2015-09-16 05:40:00.0|
## |2015-09-16 05:40:46|2015-09-16 05:41:00.0|
## |2016-03-05 02:00:10|2016-03-05 02:00:00.0|
## +-------------------+---------------------+

answered Sep 24 '22 01:09

zero323

This question was asked a few years ago, but if anyone else comes across it, as of Spark v2.3 this has been added as a feature. Now this is as simple as (assumes canon_evt is a dataframe with timestamp column dt that we want to remove the seconds from)

from pyspark.sql.functions import date_trunc

canon_evt = canon_evt.withColumn('dt', date_trunc('minute', canon_evt.dt))

answered Sep 25 '22 01:09

Blake Larkin

Related questions
                            
                                Proximity Matrix in sklearn.ensemble.RandomForestClassifier
                            
                                how to plot arbitrary markers on a pandas data series?
                            
                                Django: Check for related objects and whether it contains data
                            
                                What does base value do in int function?
                            
                                How to sort integer list in python descending order
                            
                                Create superuser Django in PyCharm
                            
                                How to return indices of values between two numbers in numpy array
                            
                                Turning off Tick Marks in Bokeh
                            
                                Fit multivariate gaussian distribution to a given dataset
                            
                                How to close socket connection on Ctrl-C in a python programme
                            
                                TypeError: __init__() takes 1 positional argument but 3 were given
                            
                                REQUESTS: Return file object from url (as with open('','rb') )
                            
                                Fill zero values of 1d numpy array with last non-zero values
                            
                                Getting statsmodels to use heteroskedasticity corrected standard errors in coefficient t-tests
                            
                                how to manage sys.path globally in pycharm
                            
                                Stopping list selection? [duplicate]
                            
                                Convert dataframe date row to a weekend / not weekend value
                            
                                Python - Generating the plural noun of a singular noun
                            
                                Adding years in python
                            
                                How to turn Numpy array to set efficiently?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

Tags:

python

datetime

apache-spark

apache-spark-sql

pyspark

PR102012

People also ask

2 Answers

zero323

Blake Larkin

Recent Activity

Donate For Us