Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe Parquet Data Types?

I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type.

I'm getting into situations where the resulting parquet data types are not what I want them to be. For example, I may write an int64 to a column and the resulting parquet will be in double format. This is causing a lot of trouble on the processing side where 99% of the data is typed correctly but in 1% of cases it's just the wrong type.

I've tried importing numpy and wrapping the values this way-

import numpy as np

pandas.DataFrame({
  'a': [ np.int64(5100), np.int64(5200), np.int64(5300) ]
})

But i'm still getting the occasional double so this must be the wrong way to do it. How can I ensure data types are consistent across columns across parquet files?

Update-

I found this only happens when the column contains one or more Nones.

data_frame = pandas.DataFrame({
  'a': [ None, np.int64(5200), np.int64(5200) ]
})

Can parquet not handle mixed None-int64 cols?

like image 742
micah Avatar asked Sep 10 '18 19:09

micah


People also ask

Do Parquet files have data types?

Parquet files use a small number of primitive (or physical) data types. The logical types extend the physical types by specifying how they should be interpreted. Parquet data types not covered here are not supported for reading from or writing to Parquet files (JSON, BSON, binary, and so on).

Can pandas DataFrame have different data types?

Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype . Make conscious decisions about how to manage missing data.

What datatype does pandas DataFrame support?

The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32. By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit).

Can pandas read Parquet?

Because Parquet is an open-source format, there are many different libraries and engines that can be used to read and write the data. Pandas allows you to customize the engine used to read the data from the file if you know which library is best.


1 Answers

Pandas itself cannot handle null/na values in integer columns at the moment (version 0.23.x). In the next release, there will be a nullable integer type. In the meantime, once you have a null value in an integer column, Pandas automatically converts this into a float column. Thus you also have a float column in your resulting Parquet file:

import numpy as np
import pandas as pd

df = pd.DataFrame({
  'a': [np.int64(5100), np.int64(5200), np.int64(5300)]
})
# df['a'].dtype == dtype('int64')
df = pd.DataFrame({
  'a': [None, np.int64(5200), np.int64(5200)]
})
# df['a'].dtype == dtype('float64')
like image 100
Uwe L. Korn Avatar answered Oct 08 '22 14:10

Uwe L. Korn