How to use sqlContext to load multiple parquet files?

Question

I'm trying to load a directory of parquet files in spark but can't seem to get it to work...this seems to work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=20151102")

but this doesn't work:

val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*")

it gives me back this error:

java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*

how do i get it to work with a wild card?

AlexL · Accepted Answer

you can read in the list of files or folders using the filesystem list status. Then go over the files/folders you want to read. Use a reduce with union to reduce all files into one single rdd.

Get the files/folders:

val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))

Read in the data:

val parquetFiles= status .map(folder => {
    sqlContext.read.parquet(folder.getPath.toString)
  })

Merge the data into single rdd:

val mergedFile= parquetFiles.reduce((x, y) => x.unionAll(y))

You can also have a look at my past posts around the same topic.

Spark Scala list folders in directory

Spark/Scala flatten and flatMap is not working on DataFrame

bruse · Answer

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

like:

basePath="hdfs://nameservice1/data/rtl/events/stream"

sparkSession.read.option("basePath", basePath).parquet(basePath + "loaddate=201511*")

How to use sqlContext to load multiple parquet files?

Tags:

apache-spark

hadoop

lightweight

2 Answers

AlexL

bruse

Recent Activity

Donate For Us

How to use sqlContext to load multiple parquet files?

Tags:

apache-spark

hadoop

lightweight

2 Answers

AlexL

bruse

Related questions

Recent Activity

Donate For Us