spark "basePath" option setting

Question

When I do:

allf = spark.read.parquet("gs://bucket/folder/*")

I get:

java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:

... And the following message after the list of paths:

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.

I am new to Spark. I believe my data source is really a collection of "folders" (something like base/top_folder/year=x/month=y/*.parquet) and I would like to load all the files and transform them.

Thanks for your help!

UPDATE 1: I've looked at the Dataproc console and there is no way to set "options" when creating a cluster.
UPDATE 2: I've checked in the cluster's "cluster.properties" file and there is no such options. Could it be I must add one and reset the cluster?

Angus Davis · Accepted Answer

Per Spark documentation on Parquet partition discovery, I believe that changing your load statement from

allf = spark.read.parquet("gs://bucket/folder/*")

to

allf = spark.read.parquet("gs://bucket/folder")

should discover and load all parquet partitions. This is assuming that the data was written with "folder" as its base directory.

If the directory base/folder actually contains mutliple datasets, you will want to load each dataset independantly and then union them together.

spark "basePath" option setting

Tags:

apache-spark

pyspark

google-cloud-dataproc

jldupont

1 Answers

Angus Davis

Recent Activity

Donate For Us

spark "basePath" option setting

Tags:

apache-spark

pyspark

google-cloud-dataproc

jldupont

1 Answers

Angus Davis

Related questions

Recent Activity

Donate For Us