Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark Creating timestamp column

I am using spark 2.1.0. I am not able to create timestamp column in pyspark I am using below code snippet. Please help

df=df.withColumn('Age',lit(datetime.now()))

I am getting

assertion error:col should be Column

Please help

like image 460
Naveen Srikanth Avatar asked Aug 02 '17 19:08

Naveen Srikanth


People also ask

How do you make a timestamp column in PySpark?

Syntax – to_timestamp() This function has above two signatures that defined in PySpark SQL Date & Timestamp Functions, the first syntax takes just one argument and the argument should be in Timestamp format ' MM-dd-yyyy HH:mm:ss. SSS ', when the format is not in this format, it returns null.

How do you add a timestamp to a PySpark DataFrame?

In order to populate current date and current timestamp in pyspark we will be using current_date() and current_timestamp() function respectively. current_date() function populates current date in a column in pyspark.

Is timestamp a datatype in PySpark?

Pyspark Time Format In PySpark, time can be stored in four data types: IntegerType (which is typically used for storing unix time), StringType , DateType , and TimeStampType . Usually the input in IntegerType or StringType will be transformed into TimeStampType or DateType .

How do you add a timestamp in spark?

Since Spark doesn't have any functions to add units to the Timestamp, we use INTERVAL to do our job. Before we apply INTERVAL, first you need to convert timestamp column from string to TimestampType using cast. Here, first, we create a temporary table using createOrReplaceTempView() and then use this on SQL select.


2 Answers

I am not sure for 2.1.0, on 2.2.1 at least you can just:

from pyspark.sql import functions as F
df.withColumn('Age', F.current_timestamp())

Hope it helps!

like image 84
balalaika Avatar answered Sep 20 '22 06:09

balalaika


Assuming you have dataframe from your code snippet and you want same timestamp for all your rows.

Let me create some dummy dataframe.

>>> dict = [{'name': 'Alice', 'age': 1},{'name': 'Again', 'age': 2}]
>>> df = spark.createDataFrame(dict)

>>> import time
>>> import datetime
>>> timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
>>> type(timestamp)
<class 'str'>

>>> from pyspark.sql.functions import lit,unix_timestamp
>>> timestamp
'2017-08-02 16:16:14'
>>> new_df = df.withColumn('time',unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
>>> new_df.show(truncate = False)
+---+-----+---------------------+
|age|name |time                 |
+---+-----+---------------------+
|1  |Alice|2017-08-02 16:16:14.0|
|2  |Again|2017-08-02 16:16:14.0|
+---+-----+---------------------+

>>> new_df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
 |-- time: timestamp (nullable = true)
like image 32
Ankush Singh Avatar answered Sep 22 '22 06:09

Ankush Singh