Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

Reading parquet files from multiple directories in Pyspark

Tags:

pyspark

parquet

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

Thanks

like image

452

asked May 16 '16 15:05

joshsuihn

People also ask

How do I read multiple folders in spark?

Read directories and files using spark.ds = spark. read(). json("/path/to/dir"); We can also specify multiple paths, each as its own argument.

1 Answers

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

like image

79

answered Oct 02 '22 15:10

N00b

Sign in to Comment

Related questions
                            
                                Pyspark dataframe write to single json file with specific name
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                `pyspark mllib` versus `pyspark ml` packages
                            
                                Apache Spark Codegen Stage grows beyond 64 KB
                            
                                PySpark DataFrames - way to enumerate without converting to Pandas?
                            
                                PySpark Throwing error Method __getnewargs__([]) does not exist
                            
                                Spark gives a StackOverflowError when training using ALS
                            
                                Casting a new derived column in a DataFrame from boolean to integer
                            
                                Applying Mapping Function on DataFrame
                            
                                PySpark add a column to a DataFrame from a TimeStampType column
                            
                                how to hide "py4j.java_gateway:Received command c on object id p0"?
                            
                                Spark RDD - is partition(s) always in RAM?
                            
                                How can I get from 'pyspark.sql.types.Row' all the columns/attributes name?
                            
                                The system cannot find the path specified error while running pyspark
                            
                                PySpark: TypeError: condition should be string or Column
                            
                                Spark can access Hive table from pyspark but not from spark-submit
                            
                                SparkSQL on pyspark: how to generate time series?
                            
                                Concatenating string by rows in pyspark
                            
                                Running pyspark after pip install pyspark
                            
                                How to do opposite of explode in PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With