Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Populate a pyspark dataframe with DATE sample data

I try to create and populate a pyspark dataframe with date values.

Columns = ["EmployeeNo", "Name", "EmployeeID", "ValidFrom", "ValidTo"]
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),
       ]

DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.show()

gives

+----------+----------------+----------+----------+----------+
|EmployeeNo|            Name|EmployeeID| ValidFrom|   ValidTo|
+----------+----------------+----------+----------+----------+
|       100|   Hilmar Buchta|        HB|2000-01-01|2999-12-31|
+----------+----------------+----------+----------+----------+

It looks correct, but the ValidFrom and ValidTo values are strings, not dates. How can I populate a df column with values of type date in one step?

I've been searching StackOverflow for a while and tried this:

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType, BooleanType

Schema = StructType([
                      StructField('EmployeeNo', IntegerType(), False),
                      StructField('Name', StringType(), False),
                      StructField('EmployeeID', StringType(), False),
                      StructField('ValidFrom', DateType(), False),
                      StructField('ValidTo', DateType(), False)
                  ])

Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),]
DfEmployee = spark.createDataFrame(Data, Columns)

gives

TypeError: field ValidFrom: DateType can not accept object '2000-01-01' in type <class 'str'>

So I tried

Data = [(100, "Hilmar Buchta", "HB", F.to_date("2000-01-01", "yyyy-MM-dd"), F.to_date("2999-12-31", "yyyy-MM-dd")),]
       ]
DfEmployee = spark.createDataFrame(Data, Columns)

gives

TypeError: field ValidFrom: DateType can not accept object Column<b"to_date(2000-01-01, 'yyyy-MM-dd')"> in type <class 'pyspark.sql.column.Column'>

like image 738
Wim Avatar asked Nov 07 '25 08:11

Wim


1 Answers

You can pass python datetime.date object instead of strings :

import datetime

Data = [
    (100, "Hilmar Buchta", "HB", datetime.date(2000, 1, 1), datetime.date(2999, 12, 31)),
]

DfEmployee = spark.createDataFrame(Data, Columns)

DfEmployee.printSchema()

#root
# |-- EmployeeNo: long (nullable = true)
# |-- Name: string (nullable = true)
# |-- EmployeeID: string (nullable = true)
# |-- ValidFrom: date (nullable = true)
# |-- ValidTo: date (nullable = true)

Or by converting the string into python date objects :

from datetime import datetime

Data = [
    (100, "Hilmar Buchta", "HB", datetime.strptime("2000-01-01", "%Y-%M-%d").date(),
     datetime.strptime("2999-12-31", "%Y-%M-%d").date()
     ),
]

The function F.to_date is used only with DataFrame. You can convert the string into date after creating the df for example :

df = df.withColumn("ValidFrom", F.to_date("ValidFrom", "yyyy-MM-dd"))
like image 136
blackbishop Avatar answered Nov 10 '25 14:11

blackbishop



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!