Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transfer and write Parquet with python and pandas got timestamp error

I tried to concat() two parquet file with pandas in python .
It can work , but when I try to write and save the Data frame to a parquet file ,it display the error :

 ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data:

I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file.
How can I white the parquet file with used schema after concat?
Here is my code:

import pandas as pd

table1 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')
table2 = pd.read_parquet(path= ('path.parquet'),engine='pyarrow')

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')
like image 531
Neil Su Avatar asked Dec 22 '18 06:12

Neil Su


People also ask

How do I write a pandas DataFrame to a parquet file?

Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

Can pandas write to parquet?

Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet .

What is Panda TimeStamp?

Timestamp. TimeStamp is the pandas equivalent of python's Datetime and is interchangable with it in most cases. It's the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas. There are essentially three calling conventions for the constructor.


Video Answer


4 Answers

Pandas already forwards unknown kwargs to the underlying parquet-engine since at least v0.22. As such, using table.to_parquet(allow_truncated_timestamps=True) should work - I verified it for pandas v0.25.0 and pyarrow 0.13.0. For more keywords see the pyarrow docs.

like image 64
Axel Avatar answered Oct 21 '22 13:10

Axel


Thanks to @axel for the link to Apache Arrow documentation:

allow_truncated_timestamps (bool, default False) – Allow loss of data when coercing timestamps to a particular resolution. E.g. if microsecond or nanosecond data is lost when coercing to ‘ms’, do not raise an exception.

It seems like in modern Pandas versions we can pass parameters to ParquetWriter.

The following code worked properly for me (Pandas 1.1.1, PyArrow 1.0.1):

df.to_parquet(filename, use_deprecated_int96_timestamps=True)
like image 41
MaxU - stop WAR against UA Avatar answered Oct 21 '22 15:10

MaxU - stop WAR against UA


I think this is a bug and you should do what Wes says. However, if you need working code now, I have a workaround.

The solution that worked for me was to specify the timestamp columns to be millisecond precision. If you need nanosecond precision, this will ruin your data... but if that's the case, it may be the least of your problems.

import pandas as pd

table1 = pd.read_parquet(path=('path1.parquet'))
table2 = pd.read_parquet(path=('path2.parquet'))

table1["Date"] = table1["Date"].astype("datetime64[ms]")
table2["Date"] = table2["Date"].astype("datetime64[ms]")

table = pd.concat([table1, table2], ignore_index=True) 
table.to_parquet('./file.gzip', compression='gzip')
like image 8
DrDeadKnee Avatar answered Oct 21 '22 15:10

DrDeadKnee


I experienced a similar problem while using pd.to_parquet, my final workaround was to use the argument engine='fastparquet', but I realize this doesn't help if you need to use PyArrow specifically.

Things I tried which did not work:

  • @DrDeadKnee's workaround of manually casting columns .astype("datetime64[ms]") did not work for me (pandas v. 0.24.2)
  • Passing coerce_timestamps='ms' as a kwarg to the underlying parquet operation did not change behaviour.
like image 4
Geoff Avatar answered Oct 21 '22 14:10

Geoff