Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read few parquet files at the same time in Spark

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

like image 843
SkyFox Avatar asked May 24 '15 07:05

SkyFox


People also ask

How do I read multiple CSV files in Spark?

Reading multiple CSV files into RDD Spark RDD's doesn't have a method to read csv file formats hence we will use textFile() method to read csv file like any other text file into RDD and split the record based on comma, pipe or any other delimiter.


2 Answers

InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)
like image 111
user6602391 Avatar answered Sep 29 '22 16:09

user6602391


FYI, you can also:

  • read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")

  • read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

like image 34
Boris Avatar answered Sep 29 '22 14:09

Boris