I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it. For example: Input: PySpark DataFrame containing : <code>col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]</code> Ouput : <code>col_4 = max(col1, col_2, col_3) = [3,2,5]</code> There is something similar in pandas as explained in this question. Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

You can reduce using SQL expressions over a list of columns: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import max as max_, col, when from functools import reduce def row_max(*cols): return reduce( lambda x, y: when(x > y, x).otherwise(y), [col(c) if isinstance(c, str) else c for c in cols] ) df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)]) .toDF(["a", "b", "c"])) df.select(row_max("a", "b", "c").alias("max"))) </code></pre> Spark 1.5+ also provides <code>least</code>, <code>greatest</code> <pre class="prettyprint"><code>from pyspark.sql.functions import greatest df.select(greatest("a", "b", "c")) </code></pre> If you want to keep name of the max you can use `structs: <pre class="prettyprint"><code>from pyspark.sql.functions import struct, lit def row_max_with_name(*cols): cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols] return greatest(*cols_).alias("greatest({0})".format(",".join(cols))) maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs")) </code></pre> And finally you can use above to find select "top" column: <pre class="prettyprint"><code>from pyspark.sql.functions import max ((_, c), ) = (maxs .groupBy(col("maxs")["col"].alias("col")) .count() .agg(max(struct(col("count"), col("col")))) .first()) df.select(c) </code></pre>

We can use <code>greatest</code> Creating DataFrame <pre class="prettyprint"><code>df = spark.createDataFrame( [[1,2,3], [2,1,2], [3,4,5]], ['col_1','col_2','col_3'] ) df.show() +-----+-----+-----+ |col_1|col_2|col_3| +-----+-----+-----+ | 1| 2| 3| | 2| 1| 2| | 3| 4| 5| +-----+-----+-----+ </code></pre> <hr> Solution <pre class="prettyprint"><code>from pyspark.sql.functions import greatest df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3')) #Only if you need col #from pyspark.sql.functions import col #df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3'))) df2.show() +-----+-----+-----+-----------+ |col_1|col_2|col_3|max_by_rows| +-----+-----+-----+-----------+ | 1| 2| 3| 3| | 2| 1| 2| 2| | 3| 4| 5| 5| +-----+-----+-----+-----------+ </code></pre>

You can also use the pyspark built-in <code>least</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import least, col df = df.withColumn('min', least(col('c1'), col('c2'), col('c3'))) </code></pre>

Comparing columns in Pyspark

3 Answers

You can reduce using SQL expressions over a list of columns:

from pyspark.sql.functions import max as max_, col, when
from functools import reduce

def row_max(*cols):
    return reduce(
        lambda x, y: when(x > y, x).otherwise(y),
        [col(c) if isinstance(c, str) else c for c in cols]
    )

df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
    .toDF(["a", "b", "c"]))

df.select(row_max("a", "b", "c").alias("max")))

Spark 1.5+ also provides least, greatest

from pyspark.sql.functions import greatest

df.select(greatest("a", "b", "c"))

If you want to keep name of the max you can use `structs:

from pyspark.sql.functions import struct, lit

def row_max_with_name(*cols):
    cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
    return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))

 maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))

And finally you can use above to find select "top" column:

from pyspark.sql.functions import max

((_, c), ) = (maxs
    .groupBy(col("maxs")["col"].alias("col"))
    .count()
    .agg(max(struct(col("count"), col("col"))))
    .first())

df.select(c)

answered Oct 12 '22 07:10

zero323

We can use greatest

Creating DataFrame

df = spark.createDataFrame(
    [[1,2,3], [2,1,2], [3,4,5]], 
    ['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
|    1|    2|    3|
|    2|    1|    2|
|    3|    4|    5|
+-----+-----+-----+

Solution

from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))

#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()

+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
|    1|    2|    3|          3|
|    2|    1|    2|          2|
|    3|    4|    5|          5|
+-----+-----+-----+-----------+

answered Oct 12 '22 08:10

ansev

You can also use the pyspark built-in least:

from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

answered Oct 12 '22 08:10

mattexx

Related questions
                            
                                Google Maps API without key?
                            
                                BeautifulSoup to find a link that contains a specific word
                            
                                vue.js: always load a settings.scss file in every vue style section
                            
                                Nginx reverse proxy error:14077438:SSL SSL_do_handshake() failed
                            
                                How to slide <View/> in and out from the bottom in React Native?
                            
                                How to set a custom store URL for NSPersistentContainer
                            
                                Flask says "did not provide the FLASK_APP environment variable"
                            
                                Laravel change locale not working
                            
                                How to disable two fingers zoom option in Maps in mobile devices using JavaScript API?
                            
                                Conditional Image Src Binding in angular 2
                            
                                Visual Studio 2017 error: Cannot find project info for "" This can indicate a missing project reference
                            
                                Schedule Spring cache eviction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing columns in Pyspark

Tags:

python

apache-spark

pyspark

Hemant

People also ask

3 Answers

zero323

ansev

mattexx

Recent Activity

Donate For Us