Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark cannot infer timestamp even with timestampFormat

I have this json file

{"created_at":"2022-01-02 12:17:43.399 UTC","updated_at":"2022-01-02 12:17:43.399 UTC"}

Trying to read it as

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .json(path)

but the inferred schema gives me back

root
 |-- created_at: string (nullable = true)
 |-- updated_at: string (nullable = true)

I've tried to forced it via withColumn("timestamp",to_timestamp(col("created_at"), "yyyy-MM-dd HH:mm:ss.SSS 'UTC'")) and it works.

I don't want to provide myself the schema but let infer it because I have different files with different schemas and want to re-use the function for reading.

I'm not sure what's wrong.

Spark versionL 3.3.2

like image 423
Tizianoreica Avatar asked Sep 12 '25 03:09

Tizianoreica


1 Answers

The inference of timestamps has to enabled explictly (docs, code):

Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option inferTimestamp to true to enable such type inference.

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .option("inferTimestamp", "true") \
            .json(path)

returns two timestamp columns.

like image 140
werner Avatar answered Sep 13 '25 18:09

werner