Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark "basePath" option setting

When I do:

allf = spark.read.parquet("gs://bucket/folder/*")

I get:

java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:

... And the following message after the list of paths:

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

I am new to Spark. I believe my data source is really a collection of "folders" (something like base/top_folder/year=x/month=y/*.parquet) and I would like to load all the files and transform them.

Thanks for your help!

  • UPDATE 1: I've looked at the Dataproc console and there is no way to set "options" when creating a cluster.
  • UPDATE 2: I've checked in the cluster's "cluster.properties" file and there is no such options. Could it be I must add one and reset the cluster?
like image 356
jldupont Avatar asked Nov 15 '16 11:11

jldupont


1 Answers

Per Spark documentation on Parquet partition discovery, I believe that changing your load statement from

allf = spark.read.parquet("gs://bucket/folder/*")

to

allf = spark.read.parquet("gs://bucket/folder")

should discover and load all parquet partitions. This is assuming that the data was written with "folder" as its base directory.

If the directory base/folder actually contains mutliple datasets, you will want to load each dataset independantly and then union them together.

like image 183
Angus Davis Avatar answered Oct 16 '22 09:10

Angus Davis