Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pyarrow how do you append to parquet file?

How do you append/update to a parquet file with pyarrow?

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})  table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})   pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here?   

There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow with multiprocessing to insert/update the data.

like image 309
Merlin Avatar asked Nov 04 '17 17:11

Merlin


People also ask

Is it possible to append to a parquet file?

Parquet slices columns into chunks and allows parts of a column to be stored in several chunks within a single file, thus append is possible.

Can you append to a parquet file Python?

Combined, these limitations mean that they cannot be used to append to an existing . parquet file, they can only be used to write a . parquet file in chunks. The technique above removes these limitations, at the expense of being less efficient as the entire file has to be rewritten to append to the end.

Can pandas write to parquet?

Yup, quite possible to write a pandas dataframe to the binary parquet format. Some additional libraries are required like pyarrow and fastparquet .

How do I add parquet files to pandas?

Pandas to_parquet can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.


2 Answers

I ran into the same issue and I think I was able to solve it using the following:

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq   chunksize=10000 # this is the number of lines  pqwriter = None for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):     table = pa.Table.from_pandas(df)     # for the first chunk of records     if i == 0:         # create a parquet write object giving it an output file         pqwriter = pq.ParquetWriter('sample.parquet', table.schema)                 pqwriter.write_table(table)  # close the parquet writer if pqwriter:     pqwriter.close() 
like image 92
Ibraheem Ibraheem Avatar answered Sep 30 '22 03:09

Ibraheem Ibraheem


In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.

# -*- coding: utf-8 -*- import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq   def append_to_parquet_table(dataframe, filepath=None, writer=None):     """Method writes/append dataframes in parquet format.      This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked     with writer, it appends dataframe to the already written pyarrow table.      :param dataframe: pd.DataFrame to be written in parquet format.     :param filepath: target file location for parquet file.     :param writer: ParquetWriter object to write pyarrow tables in parquet format.     :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame         in the pyarrow Table     """     table = pa.Table.from_pandas(dataframe)     if writer is None:         writer = pq.ParquetWriter(filepath, table.schema)     writer.write_table(table=table)     return writer   if __name__ == '__main__':      table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     writer = None     filepath = '/tmp/verify_pyarrow_append.parquet'     table_list = [table1, table2, table3]      for table in table_list:         writer = append_to_parquet_table(table, filepath, writer)      if writer:         writer.close()      df = pd.read_parquet(filepath)     print(df) 

Output:

   one  three  two 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz 
like image 36
yardstick17 Avatar answered Sep 30 '22 01:09

yardstick17