Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save spark dataframe to parquet without using INT96 format for timestamp columns?

I have a spark dataframe that I want to save as parquet then load it using the parquet-avro library.

There is a timestamp column in my dataframe that is converted to a INT96 timestamp column in parquet. However parquet-avro does not support INT96 format and throws.

Is there a way to avoid it ? Is it possible to change the format used by Spark when writing timestamps to parquet in something supported by avro ?

I currently use

date_frame.write.parquet("path")
like image 445
Fabich Avatar asked Mar 03 '23 20:03

Fabich


1 Answers

Reading spark code I have found the spark.sql.parquet.outputTimestampType property

spark.sql.parquet.outputTimestampType :
Sets which Parquet timestamp type to use when Spark writes data to Parquet files.
INT96 is a non-standard but commonly used timestamp type in Parquet.
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number of microseconds from the Unix epoch.
TIMESTAMP_MILLIS is also standard, but with millisecond precision, which means Spark has to truncate the microsecond portion of its timestamp value.

So I can do the following :

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
data_frame.write.parquet("path")
like image 71
Fabich Avatar answered Mar 06 '23 12:03

Fabich