Spark's int96 time type

Tags:

When you create a timestamp column in spark, and save to parquet, you get a 12 byte integer column type (int96); I gather the data is split into 6-bytes for Julian day and 6 bytes for nanoseconds within the day.

This does not conform to any parquet logical type. The schema in the parquet file does not, then, give an indication of the column being anything but an integer.

My question is, how does Spark know to load such a column as a timestamp as opposed to a big integer?

345

asked Mar 06 '17 14:03

mdurant

1 Answers

Semantics is determined based on the metadata. We'll need some imports:

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

example data:

val path = "/tmp/ts"

Seq((1, "2017-03-06 10:00:00")).toDF("id", "ts")
  .withColumn("ts", $"ts".cast("timestamp"))
  .write.mode("overwrite").parquet(path)

and Hadoop configuration:

val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)

Now we can access Spark metadata:

ParquetFileReader
  .readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  .get(0)
  .getParquetMetadata
  .getFileMetaData
  .getKeyValueMetaData
  .get("org.apache.spark.sql.parquet.row.metadata")

and the result is:

String = {"type":"struct","fields: [
  {"name":"id","type":"integer","nullable":false,"metadata":{}},
  {"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}

Equivalent information can be stored in the Metastore as well.

According to the official documentation this is used to achieve compatibility with Hive and Impala:

Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.

and can be controlled using spark.sql.parquet.int96AsTimestamp property.

146

answered Oct 20 '22 18:10

zero323

Related questions
                            
                                Which of these is pythonic? and Pythonic vs. Speed
                            
                                Parsing DateTime with a known but not given time zone
                            
                                `datetime.strftime` and `datetime.strptime` interprete %Y differently
                            
                                Bad conversion from EndOfTheMonth(date) to Variant value
                            
                                use NodaTime to calculate an inclusive days period
                            
                                Time zone names with identical properties yield different result when applied to timestamp
                            
                                Reading a datetime value from a SQL database
                            
                                histogram with time bins from datetime vector
                            
                                DATE vs. DATETIME casting of invalid dates in SQL SERVER 2008 R2
                            
                                Semantics of the xsd:dateTime without timezone and its conversion to Date
                            
                                How to standardise a column of mixed date formats in T-SQL
                            
                                How can I select hourly counts from a table, including missing hours?
                            
                                How do I combine date from one timestamp and time from another timestamp?
                            
                                laravel 5 insert date and time to database
                            
                                How to retrieve oracle Timestamp column in expected timezone in Java?
                            
                                DateTime.hasvalue vs datetime == null, which one is better and why [duplicate]
                            
                                Subtracting UTC and non-UTC DateTime in C#
                            
                                Convert numpy array to list of datetimes
                            
                                How to Convert a Date time value with UTC offset into GMT in java 7
                            
                                Difference between dates in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark's int96 time type

Tags:

datetime

apache-spark

parquet

mdurant

People also ask

1 Answers

zero323

Recent Activity

Donate For Us