Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitioning by multiple columns in Spark SQL

With Spark SQL's window functions, I need to partition by multiple columns to run my data queries, as follows:

val w = Window.partitionBy($"a").partitionBy($"b").rangeBetween(-100, 0)

I currently do not have a test environment (working on settings this up), but as a quick question, is this currently supported as a part of Spark SQL's window functions, or will this not work?

like image 797
Eric Staner Avatar asked Jun 13 '16 17:06

Eric Staner


People also ask

How do I select multiple columns in Spark?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns.

How do I partition a DataFrame in Spark?

Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations. Partition on disk: While writing the PySpark DataFrame back to disk, you can choose how to partition the data based on columns by using partitionBy() of pyspark. sql. DataFrameWriter .

What is partition by in Spark SQL?

PySpark partitionBy() method While writing DataFrame to Disk/File system, PySpark partitionBy() is used to partition based on column values. PySpark divides the records depending on the partition column and puts each partition data into a sub-directory when you write DataFrame to Disk using partitionBy().

What are the types of partitioning in Spark?

Apache Spark supports two types of partitioning “hash partitioning” and “range partitioning”.


1 Answers

This won't work. The second partitionBy will overwrite the first one. Both partition columns have to be specified in the same call:

val w = Window.partitionBy($"a", $"b").rangeBetween(-100, 0)
like image 120
zero323 Avatar answered Oct 17 '22 15:10

zero323