Consider a PySpark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. <pre class="prettyprint lang-none prettyprint-override"><code>+-----+----------+-----------+ |index| col1| col2 | +-----+----------+-----------+ | 0.0|0.58734024|0.085703015| | 1.0|0.67304325| 0.17850411| </code></pre> Expected result <pre class="prettyprint lang-none prettyprint-override"><code>+-----+----------+-----------+-----------+-----------+-----------+-----------+ |index| col1| col2 | col1_min | col1_mean |col2_min | col2_mean +-----+----------+-----------+-----------+-----------+-----------+-----------+ | 0.0|0.58734024|0.085703015| -5 | 2.3 | -2 | 1.4 | | 1.0|0.67304325| 0.17850411| -5 | 2.3 | -2 | 1.4 | </code></pre> To my knowledge, I'll need Window function with the whole data frame as Window, to keep the result for each row (instead of, for example, do the stats separately then join back to replicate for each row) My questions are: <ol> <li> How to write Window without any partition nor order by? I know there is the standard Window with Partition and Order, but not the one taking everything as 1 single partition <pre class="prettyprint lang-py prettyprint-override"><code>w = Window.partitionBy("col1", "col2").orderBy(desc("col1")) df = df.withColumn("col1_mean", mean("col1").over(w))) </code></pre> How would I write a Window with everything as one partition? </li> <li> Any way to write dynamically for all columns? Let's say I have 500 columns, it does not look great to write repeatedly. <pre class="prettyprint lang-py prettyprint-override"><code>df = (df .withColumn("col1_mean", mean("col1").over(w))) .withColumn("col1_min", min("col2").over(w)) .withColumn("col2_mean", mean().over(w)) ..... ) </code></pre> Let's assume I want multiple stats for each column, so each <code>colx</code> will spawn <code>colx_min, colx_max, colx_mean</code>. </li> </ol>

Instead of using window you can achieve the same with a custom aggregation in combination with cross join: <pre class="prettyprint lang-py prettyprint-override"><code>import pyspark.sql.functions as F from pyspark.sql.functions import broadcast from itertools import chain df = spark.createDataFrame([ [1, 2.3, 1], [2, 5.3, 2], [3, 2.1, 4], [4, 1.5, 5] ], ["index", "col1", "col2"]) agg_cols = [( F.min(c).alias("min_" + c), F.max(c).alias("max_" + c), F.mean(c).alias("mean_" + c)) for c in df.columns if c.startswith('col')] stats_df = df.agg(*list(chain(*agg_cols))) # there is no performance impact from crossJoin since we have only one row on the right table which we broadcast (most likely Spark will broadcast it anyway) df.crossJoin(broadcast(stats_df)).show() # +-----+----+----+--------+--------+---------+--------+--------+---------+ # |index|col1|col2|min_col1|max_col1|mean_col1|min_col2|max_col2|mean_col2| # +-----+----+----+--------+--------+---------+--------+--------+---------+ # | 1| 2.3| 1| 1.5| 5.3| 2.8| 1| 5| 3.0| # | 2| 5.3| 2| 1.5| 5.3| 2.8| 1| 5| 3.0| # | 3| 2.1| 4| 1.5| 5.3| 2.8| 1| 5| 3.0| # | 4| 1.5| 5| 1.5| 5.3| 2.8| 1| 5| 3.0| # +-----+----+----+--------+--------+---------+--------+--------+---------+ </code></pre> Note1: Using broadcast we will avoid shuffling since the broadcasted df will be send to all the executors. Note2: with <code>chain(*agg_cols)</code> we flatten the list of tuples which we created in the previous step. UPDATE: Here is the execution plan for the above program: <pre class="prettyprint"><code>== Physical Plan == *(3) BroadcastNestedLoopJoin BuildRight, Cross :- *(3) Scan ExistingRDD[index#196L,col1#197,col2#198L] +- BroadcastExchange IdentityBroadcastMode, [id=#274] +- *(2) HashAggregate(keys=[], functions=[finalmerge_min(merge min#233) AS min(col1#197)#202, finalmerge_max(merge max#235) AS max(col1#197)#204, finalmerge_avg(merge sum#238, count#239L) AS avg(col1#197)#206, finalmerge_min(merge min#241L) AS min(col2#198L)#208L, finalmerge_max(merge max#243L) AS max(col2#198L)#210L, finalmerge_avg(merge sum#246, count#247L) AS avg(col2#198L)#212]) +- Exchange SinglePartition, [id=#270] +- *(1) HashAggregate(keys=[], functions=[partial_min(col1#197) AS min#233, partial_max(col1#197) AS max#235, partial_avg(col1#197) AS (sum#238, count#239L), partial_min(col2#198L) AS min#241L, partial_max(col2#198L) AS max#243L, partial_avg(col2#198L) AS (sum#246, count#247L)]) +- *(1) Project [col1#197, col2#198L] +- *(1) Scan ExistingRDD[index#196L,col1#197,col2#198L] </code></pre> Here we see a <code>BroadcastExchange</code> of a <code>SinglePartition</code> which is broadcasting one single row since <code>stats_df</code> can fit into a <code>SinglePartition</code>. Therefore the data being shuffled here is only one row (the minimum possible).

Pyspark Window function on entire data frame

Tags:

dataframe

window-functions

apache-spark

apache-spark-sql

pyspark

Consider a PySpark data frame. I would like to summarize the entire data frame, per column, and append the result for every row.

+-----+----------+-----------+
|index|      col1| col2      |
+-----+----------+-----------+
|  0.0|0.58734024|0.085703015|
|  1.0|0.67304325| 0.17850411|

Expected result

+-----+----------+-----------+-----------+-----------+-----------+-----------+
|index|      col1| col2      |  col1_min | col1_mean |col2_min   | col2_mean
+-----+----------+-----------+-----------+-----------+-----------+-----------+
|  0.0|0.58734024|0.085703015|  -5       | 2.3       |  -2       | 1.4 |
|  1.0|0.67304325| 0.17850411|  -5       | 2.3       |  -2       | 1.4 |

To my knowledge, I'll need Window function with the whole data frame as Window, to keep the result for each row (instead of, for example, do the stats separately then join back to replicate for each row)

My questions are:

How to write Window without any partition nor order by?

I know there is the standard Window with Partition and Order, but not the one taking everything as 1 single partition
```
w = Window.partitionBy("col1", "col2").orderBy(desc("col1"))
df = df.withColumn("col1_mean", mean("col1").over(w)))
```
How would I write a Window with everything as one partition?
Any way to write dynamically for all columns?

Let's say I have 500 columns, it does not look great to write repeatedly.
```
df = (df
    .withColumn("col1_mean", mean("col1").over(w)))
    .withColumn("col1_min", min("col2").over(w))
    .withColumn("col2_mean", mean().over(w))
    .....
)
```
Let's assume I want multiple stats for each column, so each colx will spawn colx_min, colx_max, colx_mean.

994

asked Feb 26 '20 16:02

Kenny

1 Answers

Instead of using window you can achieve the same with a custom aggregation in combination with cross join:

import pyspark.sql.functions as F
from pyspark.sql.functions import broadcast
from itertools import chain

df = spark.createDataFrame([
  [1, 2.3, 1],
  [2, 5.3, 2],
  [3, 2.1, 4],
  [4, 1.5, 5]
], ["index", "col1", "col2"])

agg_cols = [(
             F.min(c).alias("min_" + c), 
             F.max(c).alias("max_" + c), 
             F.mean(c).alias("mean_" + c)) 

  for c in df.columns if c.startswith('col')]

stats_df = df.agg(*list(chain(*agg_cols)))

# there is no performance impact from crossJoin since we have only one row on the right table which we broadcast (most likely Spark will broadcast it anyway)
df.crossJoin(broadcast(stats_df)).show() 

# +-----+----+----+--------+--------+---------+--------+--------+---------+
# |index|col1|col2|min_col1|max_col1|mean_col1|min_col2|max_col2|mean_col2|
# +-----+----+----+--------+--------+---------+--------+--------+---------+
# |    1| 2.3|   1|     1.5|     5.3|      2.8|       1|       5|      3.0|
# |    2| 5.3|   2|     1.5|     5.3|      2.8|       1|       5|      3.0|
# |    3| 2.1|   4|     1.5|     5.3|      2.8|       1|       5|      3.0|
# |    4| 1.5|   5|     1.5|     5.3|      2.8|       1|       5|      3.0|
# +-----+----+----+--------+--------+---------+--------+--------+---------+

Note1: Using broadcast we will avoid shuffling since the broadcasted df will be send to all the executors.

Note2: with chain(*agg_cols) we flatten the list of tuples which we created in the previous step.

UPDATE:

Here is the execution plan for the above program:

== Physical Plan ==
*(3) BroadcastNestedLoopJoin BuildRight, Cross
:- *(3) Scan ExistingRDD[index#196L,col1#197,col2#198L]
+- BroadcastExchange IdentityBroadcastMode, [id=#274]
   +- *(2) HashAggregate(keys=[], functions=[finalmerge_min(merge min#233) AS min(col1#197)#202, finalmerge_max(merge max#235) AS max(col1#197)#204, finalmerge_avg(merge sum#238, count#239L) AS avg(col1#197)#206, finalmerge_min(merge min#241L) AS min(col2#198L)#208L, finalmerge_max(merge max#243L) AS max(col2#198L)#210L, finalmerge_avg(merge sum#246, count#247L) AS avg(col2#198L)#212])
      +- Exchange SinglePartition, [id=#270]
         +- *(1) HashAggregate(keys=[], functions=[partial_min(col1#197) AS min#233, partial_max(col1#197) AS max#235, partial_avg(col1#197) AS (sum#238, count#239L), partial_min(col2#198L) AS min#241L, partial_max(col2#198L) AS max#243L, partial_avg(col2#198L) AS (sum#246, count#247L)])
            +- *(1) Project [col1#197, col2#198L]
               +- *(1) Scan ExistingRDD[index#196L,col1#197,col2#198L]

Here we see a BroadcastExchange of a SinglePartition which is broadcasting one single row since stats_df can fit into a SinglePartition. Therefore the data being shuffled here is only one row (the minimum possible).

answered Oct 10 '22 12:10

abiratsis

Related questions
                            
                                Merge two data frame and replace the NA value in R
                            
                                How do I prevent pandas.to_datetime() function from converting 0001-01-01 to 2001-01-01
                            
                                How to select a subset of fields from an array column in Spark?
                            
                                Parsing large amount of dates with pandas - scalability - performance drops faster than linear
                            
                                Pandas: Melting columns containing tuples
                            
                                Drop first row of Spark DataFrame
                            
                                Add columns to a pivot table (pandas)
                            
                                Match by id and divide column values across two dataframes
                            
                                Calculate DataFrame values recursively
                            
                                Pandas dataframe boolean mask on multiple columns
                            
                                pmin gives the wrong answer
                            
                                Create vector of data frame names from a list of data frames
                            
                                Pandas dataframe group: sum one column, take first element from others
                            
                                separate data in 2 groups with elements of each pair in separate groups
                            
                                Combine sub lists of different lists into a list of dataframes
                            
                                Duplicate of dataframe but increasing date
                            
                                Elegant way of adding columns on a specific position in a data frame
                            
                                Hash each row of pandas dataframe column using apply
                            
                                Pandas Dataframe: to_dict() poor performance
                            
                                How to efficiently get cell values from multiple DataFrames to insert into a master DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With