Add extra hours to timestamp columns in Pyspark data frame [duplicate]

Tags:

I have a data frame in Pyspark. In this data frame I have a column which is of timestamp data type. Now I want to add extra 2 hours for each row of the timestamp column without creating any new columns.

For Example: This is sample data

df

id  testing_time            test_name

1   2017-03-12 03:19:58     Raising
2   2017-03-12 03:21:30     sleeping
3   2017-03-12 03:29:40     walking
4   2017-03-12 03:31:23     talking
5   2017-03-12 04:19:47     eating  
6   2017-03-12 04:33:51     working

I want to have something like below.

df1

id  testing_time            test_name

1   2017-03-12 05:19:58     Raising
2   2017-03-12 05:21:30     sleeping
3   2017-03-12 05:29:40     walking
4   2017-03-12 05:31:23     talking
5   2017-03-12 06:19:47     eating  
6   2017-03-12 06:33:51     working

How can I do that?

743

asked Dec 20 '17 00:12

User12345

2 Answers

One approach, that doesn't require explicit casting and uses Spark interval literals (with arguable readability advantages):

df = df.withColumn('testing_time', df.testing_time + F.expr('INTERVAL 2 HOURS'))
df.show()
+---+-------------------+---------+
| id|       testing_time|test_name|
+---+-------------------+---------+
|  1|2017-03-12 05:19:58|  Raising|
|  2|2017-03-12 05:21:30| sleeping|
|  3|2017-03-12 05:29:40|  walking|
|  4|2017-03-12 05:31:23|  talking|
|  5|2017-03-12 06:19:47|   eating|
|  6|2017-03-12 06:33:51|  working|
+---+-------------------+---------+

Or, in full:

import pyspark.sql.functions as F
from datetime import datetime

data = [
  (1, datetime(2017, 3, 12, 3, 19, 58), 'Raising'),
  (2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping'),
  (3, datetime(2017, 3, 12, 3, 29, 40), 'walking'),
  (4, datetime(2017, 3, 12, 3, 31, 23), 'talking'),
  (5, datetime(2017, 3, 12, 4, 19, 47), 'eating'),
  (6, datetime(2017, 3, 12, 4, 33, 51), 'working'),
]

df = sqlContext.createDataFrame(data, ['id', 'testing_time', 'test_name'])
df = df.withColumn('testing_time', df.testing_time + F.expr('INTERVAL 2 HOURS'))
df.show()
+---+-------------------+---------+
| id|       testing_time|test_name|
+---+-------------------+---------+
|  1|2017-03-12 05:19:58|  Raising|
|  2|2017-03-12 05:21:30| sleeping|
|  3|2017-03-12 05:29:40|  walking|
|  4|2017-03-12 05:31:23|  talking|
|  5|2017-03-12 06:19:47|   eating|
|  6|2017-03-12 06:33:51|  working|
+---+-------------------+---------+

answered Sep 20 '22 07:09

eddies

You can convert testing_time column to bigint in seconds using unix_timestamp function, add 2 hours (7200 s) and then cast the result back to timestamp:

import pyspark.sql.functions as F

df.withColumn("testing_time", (F.unix_timestamp("testing_time") + 7200).cast('timestamp')).show()
+---+-------------------+---------+
| id|       testing_time|test_name|
+---+-------------------+---------+
|  1|2017-03-12 05:19:58|  Raising|
|  2|2017-03-12 05:21:30| sleeping|
|  3|2017-03-12 05:29:40|  walking|
|  4|2017-03-12 05:31:23|  talking|
|  5|2017-03-12 06:19:47|   eating|
|  6|2017-03-12 06:33:51|  working|
+---+-------------------+---------+

answered Sep 20 '22 07:09

Psidom

Related questions
                            
                                Zbar + python, crashes on import (OSX 10.9.1)
                            
                                Iterate over a dictionary by comprehension and get a dictionary [duplicate]
                            
                                Plotting time-series data with seaborn
                            
                                What is more efficient .objects.filter().exists() or get() wrapped on a try
                            
                                Recursive feature elimination on Random Forest using scikit-learn
                            
                                traceback from a warning
                            
                                Operator NOT IN with Peewee
                            
                                'str' object has no attribute 'decode' in Python3
                            
                                base64.encodestring failing in python 3
                            
                                Using str.contains on pandas dataframe [duplicate]
                            
                                How to I hide my secret_key using virtualenv and Django?
                            
                                Django models: add index on date, desc order
                            
                                Error running Django in Intellij / Pycharm
                            
                                quantile normalization on pandas dataframe
                            
                                Sometimes request.session.session_key is None
                            
                                Inheriting Meta Class in Django Models
                            
                                How do I iterate through combinations of a list [duplicate]
                            
                                Assign a number to each unique value in a list
                            
                                Python 3: Perfect Alphabetical Order
                            
                                Pandas rolling window to return an array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Add extra hours to timestamp columns in Pyspark data frame [duplicate]

Tags:

python

apache-spark

pyspark

User12345

People also ask

2 Answers

eddies

Psidom

Recent Activity

Donate For Us