Timestamp roundtrip from Spark Python to Pandas and back

Tags:

How do you do a roundtrip conversion of timestamp data from Spark Python to Pandas and back? I read data from a Hive table in Spark, want to do some calculations in Pandas, and write the results back to Hive. Only the last part is failing, converting a Pandas timestamp back to a Spark DataFrame timestamp.

import datetime
import pandas as pd

dates = [
    ('today', '2017-03-03 11:30:00')
  , ('tomorrow', '2017-03-04 08:00:00')
  , ('next Thursday', '2017-03-09 20:00:00')
]
string_date_rdd = sc.parallelize(dates)
timestamp_date_rdd = string_date_rdd.map(lambda t: (t[0], datetime.datetime.strptime(t[1], "%Y-%m-%d %H:%M:%S')))
timestamp_df = sqlContext.createDataFrame(timestamp_date_rdd, ['Day', 'Date'])
timestamp_pandas_df = timestamp_df.toPandas()
roundtrip_df = sqlContext.createDataFrame(timestamp_pandas_df)
roundtrip_df.printSchema()
roundtrip_df.show()

root
 |-- Day: string (nullable = true)
 |-- Date: long (nullable = true)

+-------------+-------------------+
|          Day|               Date|
+-------------+-------------------+
|        today|1488540600000000000|
|     tomorrow|1488614400000000000|
|next Thursday|1489089600000000000|
+-------------+-------------------+

At this point the roundtrip Spark DataFrame has the date column as datatype long. In Pyspark this can be converted back to a datetime object easily, e.g., datetime.datetime.fromtimestamp(148908960000000000 / 1000000000), although the time of day is off by a few hours. How do I do this to convert the data type of the Spark DataFrame?

Python 3.4.5, Spark 1.6.0

Thanks, John

928

asked Mar 03 '17 16:03

John Todd

1 Answers

Here's one solution I found:

from pyspark.sql.types import TimestampType
extra_column_df = roundtrip_df.select(roundtrip_df.Day, roundtrip_df.Date).withColumn('new_date', roundtrip_df.Date / 1000000000)
roundtrip_timestamp_df = extra_column_df.select(extra_column_df.Day, extra_column_df.new_date.cast(TimestampType()).alias('Date')

Outputs:

root
 |-- Day: string (nullable = true)
 |-- Date: timestamp (nullable = true)

+-------------+--------------------+
|        Day  |                Date|
+-------------+--------------------+
|        today|2017-03-03 11:30:...|
|     tomorrow|2017-03-04 08:00:...|
|next Thursday|2017-03-09 20:00:...|
+-------------+--------------------+

As an additional bug or feature, this seems to convert all the dates to UTC, including DST awareness.

165

answered Nov 15 '22 09:11

John Todd

Related questions
                            
                                Batch Request in python to Google Search Console API
                            
                                Pandas rolling OLS being deprecated
                            
                                Use DB data model to generate SQLAlchemy models, schemas, and JSON response
                            
                                How can I re-calculate the common exponent?
                            
                                Fill Holes with Majority of Surrounding Values (Python)
                            
                                Vectorized Lookups of Pandas Series to a Dictionary
                            
                                How to iterate over a row in a SciPy sparse matrix?
                            
                                Usage of builtin sched module's non-blocking scheduler.run() method?
                            
                                How Do I Format a pandas timedelta object?
                            
                                How to set a custom field in a Django session model?
                            
                                Function is an object of class in python?
                            
                                HP QC REST API using python
                            
                                Mocking time issue in django test: time seems not to be frozen using freezegun
                            
                                How to retrieve the filename of an image with keras flow_from_directory shuffled method?
                            
                                Proper connection string to pass to sqlalchemy create_engine() for mysql AWS RDS
                            
                                Config-Class in Python
                            
                                How to animate matplotlib's drawgreatcircle function?
                            
                                Send/receive data with python socket
                            
                                is there any alternative to sys.getsizeof() in PyPy?
                            
                                How to skip blank lines with read_fwf in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Timestamp roundtrip from Spark Python to Pandas and back

Tags:

python

timestamp

pandas

apache-spark

John Todd

People also ask

1 Answers

John Todd

Recent Activity

Donate For Us