Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save bucketed DataFrame?

I am trying to save a dataFrame using bucketBy

df.write.bucketBy("column").format("parquet").save()

But this producing the error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;

Is there any other way to save the result of bucketBy?

like image 578
syl Avatar asked Feb 03 '17 10:02

syl


1 Answers

Till now, spark 2.1, save doesn't support bucketing as noted in the error message.

The method bucketBy buckets the output by the given columns and when/if it's specified, the output is laid out on the file system similar to Hive's bucketing scheme.

There is a JIRA in progress working on Hive bucketing support [SPARK-19256].

So the only available operation after bucketing would be saveAsTable which saves the content of the DataFrame/Dataset as the specified table.

And since mainly spark connects with hive so actually you are saving it to hive.

So what you are actually isn't possible for the time being with spark.

like image 72
eliasah Avatar answered Sep 17 '22 02:09

eliasah