I'm trying to load a directory of parquet files in spark but can't seem to get it to work...this seems to work:
val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=20151102")
but this doesn't work:
val df = sqlContext.load("hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*")
it gives me back this error:
java.io.FileNotFoundException: File does not exist: hdfs://nameservice1/data/rtl/events/stream/loaddate=201511*
how do i get it to work with a wild card?
you can read in the list of files or folders using the filesystem list status. Then go over the files/folders you want to read. Use a reduce with union to reduce all files into one single rdd.
Get the files/folders:
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(YOUR_HDFS_PATH))
Read in the data:
val parquetFiles= status .map(folder => {
sqlContext.read.parquet(folder.getPath.toString)
})
Merge the data into single rdd:
val mergedFile= parquetFiles.reduce((x, y) => x.unionAll(y))
You can also have a look at my past posts around the same topic.
Spark Scala list folders in directory
Spark/Scala flatten and flatMap is not working on DataFrame
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
like:
basePath="hdfs://nameservice1/data/rtl/events/stream"
sparkSession.read.option("basePath", basePath).parquet(basePath + "loaddate=201511*")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With