I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1)
reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll
or is there any fancy way using unionAll
Thanks
Read directories and files using spark.ds = spark. read(). json("/path/to/dir"); We can also specify multiple paths, each as its own argument.
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With