Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading parquet files from multiple directories in Pyspark

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

Thanks

like image 452
joshsuihn Avatar asked May 16 '16 15:05

joshsuihn


People also ask

How do I read multiple folders in spark?

Read directories and files using spark.ds = spark. read(). json("/path/to/dir"); We can also specify multiple paths, each as its own argument.


1 Answers

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

like image 79
N00b Avatar answered Oct 02 '22 15:10

N00b