My question is similar to this thread: Partitioning by multiple columns in Spark SQL but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this: <pre class="prettyprint lang-python prettyprint-override"><code>column_list = ["col1","col2"] win_spec = Window.partitionBy(column_list) </code></pre> I can get the following to work: <pre class="prettyprint lang-python prettyprint-override"><code>win_spec = Window.partitionBy(col("col1")) </code></pre> This also works: <pre class="prettyprint lang-python prettyprint-override"><code>col_name = "col1" win_spec = Window.partitionBy(col(col_name)) </code></pre> And this also works: <pre class="prettyprint lang-python prettyprint-override"><code>win_spec = Window.partitionBy([col("col1"), col("col2")]) </code></pre>

Convert column names to column expressions with a list comprehension <code>[col(x) for x in column_list]</code>: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import col column_list = ["col1","col2"] win_spec = Window.partitionBy([col(x) for x in column_list]) </code></pre>

Your first attempt should work. Consider the following example: <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f from pyspark.sql import Window df = sqlCtx.createDataFrame( [ ("a", "apple", 1), ("a", "orange", 2), ("a", "orange", 3), ("b", "orange", 3), ("b", "orange", 5) ], ["name", "fruit","value"] ) df.show() #+----+------+-----+ #|name| fruit|value| #+----+------+-----+ #| a| apple| 1| #| a|orange| 2| #| a|orange| 3| #| b|orange| 3| #| b|orange| 5| #+----+------+-----+ </code></pre> Suppose you wanted to calculate a fraction of the sum for each row, grouping by the first two columns: <pre class="prettyprint lang-python prettyprint-override"><code>cols = ["name", "fruit"] w = Window.partitionBy(cols) df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show() #+----+------+--------+ #|name| fruit|fraction| #+----+------+--------+ #| a| apple| 1.0| #| b|orange| 0.375| #| b|orange| 0.625| #| a|orange| 0.6| #| a|orange| 0.4| #+----+------+--------+ </code></pre>

Partitioning by multiple columns in PySpark with columns in a list

Tags:

window-functions

apache-spark

pyspark

My question is similar to this thread: Partitioning by multiple columns in Spark SQL

but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this:

column_list = ["col1","col2"]
win_spec = Window.partitionBy(column_list)

I can get the following to work:

win_spec = Window.partitionBy(col("col1"))

This also works:

col_name = "col1"
win_spec = Window.partitionBy(col(col_name))

And this also works:

win_spec = Window.partitionBy([col("col1"), col("col2")])

921

asked Mar 12 '18 17:03

prk

2 Answers

Convert column names to column expressions with a list comprehension [col(x) for x in column_list]:

from pyspark.sql.functions import col
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])

195

answered Sep 18 '22 19:09

Psidom

Your first attempt should work.

Consider the following example:

import pyspark.sql.functions as f
from pyspark.sql import Window

df = sqlCtx.createDataFrame(
    [
        ("a", "apple", 1),
        ("a", "orange", 2),
        ("a", "orange", 3),
        ("b", "orange", 3),
        ("b", "orange", 5)
    ],
    ["name", "fruit","value"]
)
df.show()
#+----+------+-----+
#|name| fruit|value|
#+----+------+-----+
#|   a| apple|    1|
#|   a|orange|    2|
#|   a|orange|    3|
#|   b|orange|    3|
#|   b|orange|    5|
#+----+------+-----+

Suppose you wanted to calculate a fraction of the sum for each row, grouping by the first two columns:

cols = ["name", "fruit"]
w = Window.partitionBy(cols)
df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show()

#+----+------+--------+
#|name| fruit|fraction|
#+----+------+--------+
#|   a| apple|     1.0|
#|   b|orange|   0.375|
#|   b|orange|   0.625|
#|   a|orange|     0.6|
#|   a|orange|     0.4|
#+----+------+--------+

answered Sep 18 '22 19:09

pault

Related questions
                            
                                Difference between reduce and reduceByKey in Apache Spark
                            
                                What is scheduler delay in spark UI's event timeline
                            
                                Why does Complete output mode require aggregation?
                            
                                Spark Word2vec vector mathematics
                            
                                EMR Spark - TransportClient: Failed to send RPC
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How to find the most recent partition in HIVE table
                            
                                Extracting `Seq[(String,String,String)]` from spark DataFrame
                            
                                Spark without Hadoop: Failed to Launch
                            
                                converting pandas dataframes to spark dataframe in zeppelin
                            
                                Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
                            
                                Creating Spark dataframe from numpy matrix
                            
                                Why does Spark Planner prefer sort merge join over shuffled hash join?
                            
                                Kafka topic partitions to Spark streaming
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
                            
                                Why does Spark job fail with "Exit code: 52"
                            
                                How to explode columns?
                            
                                Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
                            
                                How to get word details from TF Vector RDD in Spark ML Lib?
                            
                                Cleaning up Spark history logs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With