Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set partition for Window function for PySpark?

I'm running a PySpark job, and I'm getting the following message:

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

What does the message indicate, and how do I define a partition for a Window operation?

EDIT:

I'm trying to rank on an entire column.

My data is organized as:

A
B
A
C
D

And I want:

A,1
B,3
A,1
C,4
D,5

I don't think there should by a .partitionBy() for this, only .orderBy(). The trouble is, this appears to cause performance degradation. Is there another way to achieve this without a Window function?

If I partition by the first column, the result would be:

A,1
B,1
A,1
C,1
D,1

Which I do not want.

like image 918
cshin9 Avatar asked Apr 05 '16 19:04

cshin9


People also ask

How do I partition in PySpark?

PySpark supports partition in two ways; partition in memory (DataFrame) and partition on the disk (File system). Partition in memory: You can partition or repartition the DataFrame by calling repartition() or coalesce() transformations.

How does window function work in PySpark?

PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations.

How do I increase partition in PySpark?

If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions.


1 Answers

Given the information given to the question, at best I can provide a skeleton on how partitions should be defined on Window functions :

from pyspark.sql.window import Window

windowSpec = \
     Window \
     .partitionBy(...) \ # Here is where you define partitioning
     .orderBy(…)

This is equivalent to the following SQL :

OVER (PARTITION BY ... ORDER BY …)

So concerning partitioning specification :

It controls which rows will be in the same partition with the given row. You might want to make sure all rows having the same value for the partition column are collected to the same machine before ordering and calculating the frame.

If you don't give any partitioning specification, then all data must be collected to a single machine, thus the following error message :

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
like image 82
eliasah Avatar answered Nov 03 '22 00:11

eliasah