how to get the name of column with maximum value in pyspark dataframe

Tags:

How do we get the name of the column pyspark dataframe ?

   Alice  Eleonora  Mike  Helen       MAX
0      2         7     8      6      Mike
1     11         5     9      4     Alice
2      6        15    12      3  Eleonora
3      5         3     7      8     Helen

I need something like this. name of the columns no the max values, i am able to get the max values, i need the name

987

asked Oct 18 '17 21:10

Vikas Bishnoi

1 Answers

You can chain conditions to find which columns is equal to the maximum value:

cond = "psf.when" + ".when".join(["(psf.col('" + c + "') == psf.col('max_value'), psf.lit('" + c + "'))" for c in df.columns])
import pyspark.sql.functions as psf
df.withColumn("max_value", psf.greatest(*df.columns))\
    .withColumn("MAX", eval(cond))\
    .show()

    +-----+--------+----+-----+---------+--------+
    |Alice|Eleonora|Mike|Helen|max_value|     MAX|
    +-----+--------+----+-----+---------+--------+
    |    2|       7|   8|    6|        8|    Mike|
    |   11|       5|   9|    4|       11|   Alice|
    |    6|      15|  12|    3|       15|Eleonora|
    |    5|       3|   7|    8|        8|   Helen|
    +-----+--------+----+-----+---------+--------+

OR: explode and filter

from itertools import chain
df.withColumn("max_value", psf.greatest(*df.columns))\
    .select("*", psf.posexplode(psf.create_map(list(chain(*[(psf.lit(c), psf.col(c)) for c in df.columns])))))\
    .filter("max_value = value")\
    .select(df.columns + [psf.col("key").alias("MAX")])\
    .show()

OR: using a UDF on a dictionary:

from pyspark.sql.types import *
argmax_udf = psf.udf(lambda m: max(m, key=m.get), StringType())
df.withColumn("map", psf.create_map(list(chain(*[(psf.lit(c), psf.col(c)) for c in df.columns]))))\
    .withColumn("MAX", argmax_udf("map"))\
    .drop("map")\
    .show()

OR: using a UDF with a parameter:

from pyspark.sql.types import *
def argmax(cols, *args):
    return [c for c, v in zip(cols, args) if v == max(args)][0]
argmax_udf = lambda cols: psf.udf(lambda *args: argmax(cols, *args), StringType())
df.withColumn("MAX", argmax_udf(df.columns)(*df.columns))\
    .show()

answered Sep 27 '22 22:09

MaFF

Related questions
                            
                                Use of loc to update a dataframe python pandas
                            
                                Extracting comments from Python Source Code
                            
                                How to allow Python.app to firewall on Mac OS X?
                            
                                How to measure Python's asyncio code performance?
                            
                                pandas read_table usecols error with ":"
                            
                                how to send a list in python requests GET
                            
                                Dot product along third axis
                            
                                How to scrape all contents from infinite scroll website? scrapy
                            
                                How to change the temperature of a softmax output in Keras
                            
                                How to connect HBase and Spark using Python?
                            
                                np_utils.to_categorical Reverse
                            
                                Python Matplotlib FuncAnimation.save() only saves 100 frames
                            
                                How to boost a Keras based neural network using AdaBoost?
                            
                                Python error: "socket.error: [Errno 11] Resource temporarily unavailable" when sending image
                            
                                Pandas: create dataframe without auto ordering column names alphabetically
                            
                                Sequentially read huge CSV file in python
                            
                                Pandas missing x tick labels [duplicate]
                            
                                Generate sql with subquery as a column in select statement using SQLAlchemy
                            
                                What is the explicit python3 type for dict_keys for isinstance() check?
                            
                                what does `yield from asyncio.sleep(delay)` do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to get the name of column with maximum value in pyspark dataframe

Tags:

python

dataframe

pyspark

Vikas Bishnoi

People also ask

1 Answers

MaFF

Recent Activity

Donate For Us