Given a Spark DataFrame <code>df</code>, I want to find the maximum value in a certain numeric column <code>'values'</code>, and obtain the row(s) where that value was reached. I can of course do this: <pre class="prettyprint lang-py prettyprint-override"><code># it doesn't matter if I use scala or python, # since I hope I get this done with DataFrame API import pyspark.sql.functions as F max_value = df.select(F.max('values')).collect()[0][0] df.filter(df.values == max_value).show() </code></pre> but this is inefficient since it requires two passes through <code>df</code>. <code>pandas.Series</code>/<code>DataFrame</code> and <code>numpy.array</code> have <code>argmax</code>/<code>idxmax</code> methods that do this efficiently (in one pass). So does standard python (built-in function <code>max</code> accepts a key parameter, so it can be used to find the index of the highest value). What is the right approach in Spark? Note that I don't mind whether I get all the rows that where the maximum value is achieved, or just some arbitrary (non-empty!) subset of those rows.

If schema is <code>Orderable</code> (schema contains only atomics / arrays of atomics / recursively orderable structs) you can use simple aggregations: Python: <pre class="prettyprint lang-py prettyprint-override"><code>df.select(F.max( F.struct("values", *(x for x in df.columns if x != "values")) )).first() </code></pre> Scala: <pre class="prettyprint lang-scala prettyprint-override"><code>df.select(max(struct( $"values" +: df.columns.collect {case x if x!= "values" => col(x)}: _* ))).first </code></pre> Otherwise you can reduce over <code>Dataset</code> (Scala only) but it requires additional deserialization: <pre class="prettyprint lang-scala prettyprint-override"><code>type T = ??? df.reduce((a, b) => if (a.getAs[T]("values") > b.getAs[T]("values")) a else b) </code></pre> You can also <code>oredrBy</code> and <code>limit(1)</code> / <code>take(1)</code>: Scala: <pre class="prettyprint lang-scala prettyprint-override"><code>df.orderBy(desc("values")).limit(1) // or df.orderBy(desc("values")).take(1) </code></pre> Python: <pre class="prettyprint lang-py prettyprint-override"><code>df.orderBy(F.desc('values')).limit(1) # or df.orderBy(F.desc("values")).take(1) </code></pre>

Maybe it's an incomplete answer but you can use <code>DataFrame</code>'s internal <code>RDD</code>, apply the <code>max</code> method and get the maximum record using a determined key. <pre class="prettyprint lang-py prettyprint-override"><code>a = sc.parallelize([ ("a", 1, 100), ("b", 2, 120), ("c", 10, 1000), ("d", 14, 1000) ]).toDF(["name", "id", "salary"]) a.rdd.max(key=lambda x: x["salary"]) # Row(name=u'c', id=10, salary=1000) </code></pre>

argmax in Spark DataFrames: how to retrieve the row with the maximum value

Tags:

apache-spark

apache-spark-sql

Given a Spark DataFrame df, I want to find the maximum value in a certain numeric column 'values', and obtain the row(s) where that value was reached. I can of course do this:

# it doesn't matter if I use scala or python, 
# since I hope I get this done with DataFrame API
import pyspark.sql.functions as F
max_value = df.select(F.max('values')).collect()[0][0]
df.filter(df.values == max_value).show()

but this is inefficient since it requires two passes through df.

pandas.Series/DataFrame and numpy.array have argmax/idxmax methods that do this efficiently (in one pass). So does standard python (built-in function max accepts a key parameter, so it can be used to find the index of the highest value).

What is the right approach in Spark? Note that I don't mind whether I get all the rows that where the maximum value is achieved, or just some arbitrary (non-empty!) subset of those rows.

349

asked Aug 07 '16 07:08

max

2 Answers

If schema is Orderable (schema contains only atomics / arrays of atomics / recursively orderable structs) you can use simple aggregations:

Python:

df.select(F.max(
    F.struct("values", *(x for x in df.columns if x != "values"))
)).first()

Scala:

df.select(max(struct(
    $"values" +: df.columns.collect {case x if x!= "values" => col(x)}: _*
))).first

Otherwise you can reduce over Dataset (Scala only) but it requires additional deserialization:

type T = ???

df.reduce((a, b) => if (a.getAs[T]("values") > b.getAs[T]("values")) a else b)

You can also oredrBy and limit(1) / take(1):

Scala:

df.orderBy(desc("values")).limit(1)
// or
df.orderBy(desc("values")).take(1)

Python:

df.orderBy(F.desc('values')).limit(1)
# or
df.orderBy(F.desc("values")).take(1)

117

answered Oct 19 '22 07:10

zero323

Maybe it's an incomplete answer but you can use DataFrame's internal RDD, apply the max method and get the maximum record using a determined key.

a = sc.parallelize([
    ("a", 1, 100),
    ("b", 2, 120),
    ("c", 10, 1000),
    ("d", 14, 1000)
  ]).toDF(["name", "id", "salary"])

a.rdd.max(key=lambda x: x["salary"]) # Row(name=u'c', id=10, salary=1000)

answered Oct 19 '22 06:10

Alberto Bonsanto

Related questions
                            
                                Spark Write to S3 V4 SignatureDoesNotMatch Error
                            
                                Are failed spark executors a cause for concern?
                            
                                Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)
                            
                                Hello world in zeppelin failed
                            
                                Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator
                            
                                Empty output for Watermarked Aggregation Query in Append Mode
                            
                                How to save models from ML Pipeline to S3 or HDFS?
                            
                                create empty array-column of given schema in Spark
                            
                                Spark : check your cluster UI to ensure that workers are registered
                            
                                Spark Task not serializable with lag Window function
                            
                                Spark and Java: Exception thrown in awaitResult
                            
                                Apache Spark Dataframe Groupby agg() for multiple columns
                            
                                How to append an element to an array column of a Spark Dataframe?
                            
                                Does join parallelise well in Spark?
                            
                                error: not found: type SparkConf
                            
                                How to submit a spark job on a remote master node in yarn client mode?
                            
                                How to read Avro file in PySpark
                            
                                Spark: coalesce very slow even the output data is very small
                            
                                Convert Dataframe to a Map(Key-Value) in Spark
                            
                                Why does df.limit keep changing in Pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With