Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read multiple .parquet files from multiple directories into single pandas dataframe?

Tags:

pandas

parquet

I need to read parquet files from multiple directories.

for example,

 Dir---
          |
           ----dir1---
                      |
                       .parquet
                       .parquet
          |
           ----dir2---
                      |
                       .parquet
                       .parquet
                       .parquet

Is there a way to read these file to single pandas data frame?

note: All of parquet files was generated using pyspark.

like image 933
Ahmad Senousi Avatar asked Oct 25 '25 03:10

Ahmad Senousi


1 Answers

Use read_parquet in list comprehension and concat with all files generated by glob with ** (python 3.5+):

import pandas as pd
import glob

files = glob.glob('Dir/**/*.parquet')
df = pd.concat([pd.read_parquet(fp) for fp in files])
like image 122
jezrael Avatar answered Oct 26 '25 17:10

jezrael