Using pyarrow how do you append to parquet file?

Tags:

How do you append/update to a parquet file with pyarrow?

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq    table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})  table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})   pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here?

There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow with multiprocessing to insert/update the data.

309

asked Nov 04 '17 17:11

Merlin

2 Answers

I ran into the same issue and I think I was able to solve it using the following:

Click to copy

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq   chunksize=10000 # this is the number of lines  pqwriter = None for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):     table = pa.Table.from_pandas(df)     # for the first chunk of records     if i == 0:         # create a parquet write object giving it an output file         pqwriter = pq.ParquetWriter('sample.parquet', table.schema)                 pqwriter.write_table(table)  # close the parquet writer if pqwriter:     pqwriter.close()

answered Sep 30 '22 03:09

Ibraheem Ibraheem

In your case the column name is not consistent, I made the column name consistent for three sample dataframes and the following code worked for me.

Click to copy

# -*- coding: utf-8 -*- import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq   def append_to_parquet_table(dataframe, filepath=None, writer=None):     """Method writes/append dataframes in parquet format.      This method is used to write pandas DataFrame as pyarrow Table in parquet format. If the methods is invoked     with writer, it appends dataframe to the already written pyarrow table.      :param dataframe: pd.DataFrame to be written in parquet format.     :param filepath: target file location for parquet file.     :param writer: ParquetWriter object to write pyarrow tables in parquet format.     :return: ParquetWriter object. This can be passed in the subsequenct method calls to append DataFrame         in the pyarrow Table     """     table = pa.Table.from_pandas(dataframe)     if writer is None:         writer = pq.ParquetWriter(filepath, table.schema)     writer.write_table(table=table)     return writer   if __name__ == '__main__':      table1 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     table3 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})     writer = None     filepath = '/tmp/verify_pyarrow_append.parquet'     table_list = [table1, table2, table3]      for table in table_list:         writer = append_to_parquet_table(table, filepath, writer)      if writer:         writer.close()      df = pd.read_parquet(filepath)     print(df)

Output:

Click to copy

   one  three  two 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz 0 -1.0   True  foo 1  NaN  False  bar 2  2.5   True  baz

answered Sep 30 '22 01:09

yardstick17

Related questions
                            
                                Easiest way to develop simple GUI in Python [closed]
                            
                                Python - Cleanest way to override __init__ where an optional kwarg must be used after the super() call?
                            
                                Python pandas empty correlation matrix
                            
                                What is a "scalar" in numpy?
                            
                                Python os.walk + follow symlinks
                            
                                Implement list-like index access in Python
                            
                                Trying to use open(filename, 'w' ) gives IOError: [Errno 2] No such file or directory if directory doesn't exist
                            
                                Inserting a degree symbol into python plot
                            
                                Python: Exporting environment variables in subprocess.Popen(..)
                            
                                How to run all PyTest assertions even if some of them fail?
                            
                                Getting data from ctypes array into numpy
                            
                                Setupterm could not find terminal, in Python program using curses
                            
                                Hyperparameter optimization for Pytorch model [closed]
                            
                                How to solve the "Mastermind" guessing game?
                            
                                How to make Django QuerySet bulk delete() more efficient
                            
                                Removing Duplicates From Dictionary
                            
                                What are good practices for avoiding crashes / hangs in PyQt?
                            
                                how to release used memory immediately in python list?
                            
                                What is the Python way of chaining maps and filters?
                            
                                Compare Python Pandas DataFrames for matching rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using pyarrow how do you append to parquet file?

Tags:

python

pandas

parquet

pyarrow

Merlin

People also ask

2 Answers

Ibraheem Ibraheem

yardstick17

Recent Activity

Donate For Us