Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyarrow apply schema when using pandas to_parquet()

I have a very wide data frame (20,000 columns) that is mainly made up of float64 columns in Pandas. I want to cast these columns to float32 and write to Parquet format. I am doing this because the down steam user of these files are small containers with limited memory.

I currently cast within Pandas but this very slow on a wide data set and then write out to parquet. Is it possible to cast the types while doing the write to_parquet process itself? A dummy example is shown below.

import pandas as pd
import numpy as np
import pyarrow
df = pd.DataFrame(np.random.randn(3000, 15000)) # make dummy data set
df.columns = [str(x) for x in list(df)] # make column names string for parquet
df[list(df.loc[:, df.dtypes == float])] = df[list(df.loc[:, df.dtypes == float])].astype('float32') # cast the data
df.to_parquet("myfile.parquet") # write out the df
like image 603
warwickh Avatar asked Oct 17 '18 08:10

warwickh


2 Answers

Using pandas 1.0.x and pyarrow 0.15+ it is possible to pass schema parameter in to_parquet as presented in below using schema definition taken from this post. Types in pyarrow to use for schema definition.

import pandas as pd
import pyarrow as pa

FILE_PATH = "/tmp/df.parquet"
df = pd.DataFrame({'a': [None, None]})
df.to_parquet(FILE_PATH)
pd.read_parquet(FILE_PATH).dtypes

This gives the following type:

a    object
dtype: object

With the schema defined:

SCHEMA = pa.schema([('a', pa.int32())])
df.to_parquet(FILE_PATH, schema=SCHEMA)

pd.read_parquet(FILE_PATH).dtypes

It now gives the following type:

a    float64
dtype: object
like image 111
Krzysztof Słowiński Avatar answered Oct 14 '22 09:10

Krzysztof Słowiński


Try using arrow instead of pandas to do the down casting:

def convert_arrow(df):
    table = pa.Table.from_pandas(df)
    columns = [
        c.cast(pa.float32()) if c.type == pa.float64() else c
        for c in table
    ]
    return pa.Table.from_arrays(columns, table.column_names)

I did a simple benchmark and it is 20 time faster.

I think the problem with your code is that assigning columns one by one in an existing DataFrmae, which is not efficient. This blog post explains it well: https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

Another simple solution not involving arrow is to convert each columns and create the Dataframe at the end. The code below is slightly slower than the arrow version:

def convert_pandas_by_columns(df):
    columns = [
        df[c].astype('float32') if df[c].dtype == float else df[c]
        for c in df.columns
    ]
    return pd.DataFrame(columns)
like image 35
0x26res Avatar answered Oct 14 '22 08:10

0x26res