How do you append/update to a parquet
file with pyarrow
?
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]}) pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here?
There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow
with multiprocessing to insert/update the data.
Parquet slices columns into chunks and allows parts of a column to be stored in several chunks within a single file, thus append is possible.
Combined, these limitations mean that they cannot be used to append to an existing . parquet file, they can only be used to write a . parquet file in chunks. The technique above removes these limitations, at the expense of being less efficient as the entire file has to be rewritten to append to the end.
Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet .
Pandas to_parquet can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.
I ran into the same issue and I think I was able to solve it using the following:
import pandas as pd import pyarrow as pa import pyarrow.parquet as pq chunksize=10000 # this is the number of lines pqwriter = None for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)): table = pa.Table.from_pandas(df) # for the first chunk of records if i == 0: # create a parquet write object giving it an output file pqwriter = pq.ParquetWriter('sample.parquet', table.schema) pqwriter.write_table(table) # close the parquet writer if pqwriter: pqwriter.close()
In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.
# -*- coding: utf-8 -*- import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def append_to_parquet_table(dataframe, filepath=None, writer=None): """Method writes/append dataframes in parquet format. This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked with writer, it appends dataframe to the already written pyarrow table. :param dataframe: pd.DataFrame to be written in parquet format. :param filepath: target file location for parquet file. :param writer: ParquetWriter object to write pyarrow tables in parquet format. :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame in the pyarrow Table """ table = pa.Table.from_pandas(dataframe) if writer is None: writer = pq.ParquetWriter(filepath, table.schema) writer.write_table(table=table) return writer if __name__ == '__main__': table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) writer = None filepath = '/tmp/verify_pyarrow_append.parquet' table_list = [table1, table2, table3] for table in table_list: writer = append_to_parquet_table(table, filepath, writer) if writer: writer.close() df = pd.read_parquet(filepath) print(df)
Output:
one three two 0 -1.0 True foo 1 NaN False bar 2 2.5 True baz 0 -1.0 True foo 1 NaN False bar 2 2.5 True baz 0 -1.0 True foo 1 NaN False bar 2 2.5 True baz
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With