Read multiple parquet files in a folder and write to single csv file using python

Tags:

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.

I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.

I learnt to convert single parquet to csv file using pyarrow with the following code:

Click to copy

import pandas as pd    
df = pd.read_parquet('par_file.parquet')    
df.to_csv('csv_file.csv')

But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.

537

asked Aug 05 '18 17:08

Pri31

1 Answers

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.

A better alternative would be to read all the parquet files into a single DataFrame, and write it once:

Click to copy

from pathlib import Path
import pandas as pd

data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
    pd.read_parquet(parquet_file)
    for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')

Alternatively, if you really want to just append to the file:

Click to copy

data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
    df = pd.read_parquet(parquet_path)
    write_header = i == 0 # write header only on the 0th file
    write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
    df.to_csv('csv_file.csv', mode=write_mode, header=write_header)

A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):

Click to copy

data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
    for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
        df = pd.read_parquet(parquet_path)
        write_header = i == 0 # write header only on the 0th file
        df.to_csv(csv_handle, header=write_header)

194

answered Oct 14 '22 14:10

PMende

Related questions
                            
                                Remove values that appear only once in a DataFrame column
                            
                                Reverse legend order pandas plot
                            
                                Python: Adding hours to pandas timestamp
                            
                                Pandas - dropping rows with missing data not working using .isnull(), notnull(), dropna()
                            
                                How to get maximum and minimum of a list in column?
                            
                                How to load only specific columns from csv file into a DataFrame
                            
                                how to convert csv to dictionary using pandas
                            
                                Clean way to convert quarterly periods to datetime in pandas
                            
                                Find end date of quarter given date, pandas
                            
                                Remove Row Index dataframe pandas
                            
                                How to use groupby to concatenate strings in python pandas?
                            
                                Convert data types of a Pandas dataframe to match another
                            
                                Remove time portion of DateTime index in pandas
                            
                                Converting a list of tuples to a Pandas series
                            
                                How do you read in a dataframe with lists using pd.read_clipboard?
                            
                                Python - Remove decimal and zero from string
                            
                                How to assign count of unique values to the records in a data frame in python
                            
                                Calculating Average True Range (ATR) on OHLC data with Python
                            
                                Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle
                            
                                Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read multiple parquet files in a folder and write to single csv file using python

Tags:

pandas

csv

parquet

Pri31

People also ask

1 Answers

PMende

Recent Activity

Donate For Us