PySpark cannot infer timestamp even with timestampFormat

Question

I have this json file

{"created_at":"2022-01-02 12:17:43.399 UTC","updated_at":"2022-01-02 12:17:43.399 UTC"}

Trying to read it as

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .json(path)

but the inferred schema gives me back

root
 |-- created_at: string (nullable = true)
 |-- updated_at: string (nullable = true)

I've tried to forced it via withColumn("timestamp",to_timestamp(col("created_at"), "yyyy-MM-dd HH:mm:ss.SSS 'UTC'")) and it works.

I don't want to provide myself the schema but let infer it because I have different files with different schemas and want to re-use the function for reading.

I'm not sure what's wrong.

Spark versionL 3.3.2

werner · Accepted Answer

The inference of timestamps has to enabled explictly (docs, code):

Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option inferTimestamp to true to enable such type inference.

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .option("inferTimestamp", "true") \
            .json(path)

returns two timestamp columns.

PySpark cannot infer timestamp even with timestampFormat

Tags:

date-formatting

apache-spark

pyspark

Tizianoreica

1 Answers

werner

Recent Activity

Donate For Us

PySpark cannot infer timestamp even with timestampFormat

Tags:

date-formatting

apache-spark

pyspark

Tizianoreica

1 Answers

werner

Related questions

Recent Activity

Donate For Us