I try to create and populate a pyspark dataframe with date values.
Columns = ["EmployeeNo", "Name", "EmployeeID", "ValidFrom", "ValidTo"]
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.show()
gives
+----------+----------------+----------+----------+----------+
|EmployeeNo| Name|EmployeeID| ValidFrom| ValidTo|
+----------+----------------+----------+----------+----------+
| 100| Hilmar Buchta| HB|2000-01-01|2999-12-31|
+----------+----------------+----------+----------+----------+
It looks correct, but the ValidFrom and ValidTo values are strings, not dates. How can I populate a df column with values of type date in one step?
I've been searching StackOverflow for a while and tried this:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType, BooleanType
Schema = StructType([
StructField('EmployeeNo', IntegerType(), False),
StructField('Name', StringType(), False),
StructField('EmployeeID', StringType(), False),
StructField('ValidFrom', DateType(), False),
StructField('ValidTo', DateType(), False)
])
Data = [(100, "Hilmar Buchta", "HB", "2000-01-01", "2999-12-31"),]
DfEmployee = spark.createDataFrame(Data, Columns)
gives
TypeError: field ValidFrom: DateType can not accept object '2000-01-01' in type <class 'str'>
So I tried
Data = [(100, "Hilmar Buchta", "HB", F.to_date("2000-01-01", "yyyy-MM-dd"), F.to_date("2999-12-31", "yyyy-MM-dd")),]
]
DfEmployee = spark.createDataFrame(Data, Columns)
gives
TypeError: field ValidFrom: DateType can not accept object Column<b"to_date(
2000-01-01, 'yyyy-MM-dd')"> in type <class 'pyspark.sql.column.Column'>
You can pass python datetime.date object instead of strings :
import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.date(2000, 1, 1), datetime.date(2999, 12, 31)),
]
DfEmployee = spark.createDataFrame(Data, Columns)
DfEmployee.printSchema()
#root
# |-- EmployeeNo: long (nullable = true)
# |-- Name: string (nullable = true)
# |-- EmployeeID: string (nullable = true)
# |-- ValidFrom: date (nullable = true)
# |-- ValidTo: date (nullable = true)
Or by converting the string into python date objects :
from datetime import datetime
Data = [
(100, "Hilmar Buchta", "HB", datetime.strptime("2000-01-01", "%Y-%M-%d").date(),
datetime.strptime("2999-12-31", "%Y-%M-%d").date()
),
]
The function F.to_date is used only with DataFrame. You can convert the string into date after creating the df for example :
df = df.withColumn("ValidFrom", F.to_date("ValidFrom", "yyyy-MM-dd"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With