I am trying to find the maximum value of multiple columns in a Spark dataframe. Each Column has a value of double type. The dataframe is like: <pre class="prettyprint lang-scala prettyprint-override"><code>+-----+---+----+---+---+ |Name | A | B | C | D | +-----+---+----+---+---+ |Alex |5.1|-6.2| 7| 8| |John | 7| 8.3| 1| 2| |Alice| 5| 46| 3| 2| |Mark |-20| -11|-22| -5| +-----+---+----+---+---+ </code></pre> The expectation is: <pre class="prettyprint"><code>+-----+---+----+---+---+----------+ |Name | A | B | C | D | MaxValue | +-----+---+----+---+---+----------+ |Alex |5.1|-6.2| 7| 8| 8 | |John | 7| 8.3| 1| 2| 8.3 | |Alice| 5| 46| 3| 2| 46 | |Mark |-20| -11|-22| -5| -5 | +-----+---+----+---+---+----------+ </code></pre>

You could apply <code>greatest</code> to the list of numeric columns, as shown below: <pre class="prettyprint"><code>import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import spark.implicits._ val df = Seq( ("Alex", 5.1, -6.2, 7.0, 8.0), ("John", 7.0, 8.3, 1.0, 2.0), ("Alice", 5.0, 46.0, 3.0, 2.0), ("Mark", -20.0, -11.0, -22.0, -5.0), ).toDF("Name", "A", "B", "C", "D") val numCols = df.columns.tail // Apply suitable filtering as needed (*) df.withColumn("MaxValue", greatest(numCols.head, numCols.tail: _*)). show // +-----+-----+-----+-----+----+--------+ // | Name| A| B| C| D|MaxValue| // +-----+-----+-----+-----+----+--------+ // | Alex| 5.1| -6.2| 7.0| 8.0| 8.0| // | John| 7.0| 8.3| 1.0| 2.0| 8.3| // |Alice| 5.0| 46.0| 3.0| 2.0| 46.0| // | Mark|-20.0|-11.0|-22.0|-5.0| -5.0| // +-----+-----+-----+-----+----+--------+ </code></pre> (*) For example, to filter for all top-level <code>DoubleType</code> columns: <pre class="prettyprint"><code>import org.apache.spark.sql.types._ val numCols = df.schema.fields.collect{ case StructField(name, DoubleType, _, _) => name } </code></pre> If you're on <code>Spark 2.4+</code>, an alternative would be to use <code>array_max</code>, although it would involve an additional step of transformation in this case: <pre class="prettyprint"><code>df.withColumn("MaxValue", array_max(array(numCols.map(col): _*))) </code></pre>

How to find the max value of multiple columns?

Tags:

scala

apache-spark

apache-spark-sql

I am trying to find the maximum value of multiple columns in a Spark dataframe. Each Column has a value of double type.

The dataframe is like:

+-----+---+----+---+---+
|Name | A | B  | C | D |
+-----+---+----+---+---+
|Alex |5.1|-6.2|  7|  8|
|John |  7| 8.3|  1|  2|
|Alice|  5|  46|  3|  2|
|Mark |-20| -11|-22| -5|
+-----+---+----+---+---+

The expectation is:

+-----+---+----+---+---+----------+
|Name | A | B  | C | D | MaxValue |
+-----+---+----+---+---+----------+
|Alex |5.1|-6.2|  7|  8|     8    |
|John |  7| 8.3|  1|  2|     8.3  | 
|Alice|  5|  46|  3|  2|     46   |
|Mark |-20| -11|-22| -5|     -5   |
+-----+---+----+---+---+----------+

755

asked Aug 16 '19 22:08

user2967251

1 Answers

You could apply greatest to the list of numeric columns, as shown below:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("Alex", 5.1, -6.2, 7.0, 8.0),
  ("John", 7.0, 8.3, 1.0, 2.0),
  ("Alice", 5.0, 46.0, 3.0, 2.0),
  ("Mark", -20.0, -11.0, -22.0, -5.0),
).toDF("Name", "A", "B", "C", "D")

val numCols = df.columns.tail  // Apply suitable filtering as needed (*)

df.withColumn("MaxValue", greatest(numCols.head, numCols.tail: _*)).
  show
// +-----+-----+-----+-----+----+--------+
// | Name|    A|    B|    C|   D|MaxValue|
// +-----+-----+-----+-----+----+--------+
// | Alex|  5.1| -6.2|  7.0| 8.0|     8.0|
// | John|  7.0|  8.3|  1.0| 2.0|     8.3|
// |Alice|  5.0| 46.0|  3.0| 2.0|    46.0|
// | Mark|-20.0|-11.0|-22.0|-5.0|    -5.0|
// +-----+-----+-----+-----+----+--------+

(*) For example, to filter for all top-level DoubleType columns:

import org.apache.spark.sql.types._

val numCols = df.schema.fields.collect{
  case StructField(name, DoubleType, _, _) => name
}

If you're on Spark 2.4+, an alternative would be to use array_max, although it would involve an additional step of transformation in this case:

df.withColumn("MaxValue", array_max(array(numCols.map(col): _*)))

answered Sep 23 '22 00:09

Leo C

Related questions
                            
                                spark kafka producer serializable
                            
                                Sort by dateTime in scala
                            
                                Isn't lambda function also an object with Function1 trait?
                            
                                Spark Dataframes- Reducing By Key
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?
                            
                                Scala/Spark dataframes: find the column name corresponding to the max
                            
                                Pattern Matching - @ versus :?
                            
                                Scala immutable Map vs List of tuples
                            
                                Type Level Programming in Scala
                            
                                Apache Spark how to append new column from list/array to Spark dataframe
                            
                                Invert a Map (String -> List) in Scala
                            
                                printing elements in list using stream
                            
                                Eagerly-evaluate-and-forget behavior for Cats Effect IO
                            
                                module not found: org.scala-sbt#sbt;1.1.6
                            
                                Doobie transact over a list of ConnectionIO programs
                            
                                Apache Spark startsWith in SQL expression
                            
                                How can I validate Option values with Cats validation?
                            
                                How to setup build.sbt with sbt-assembly plugin?
                            
                                Where does this .get(x) behavior come from?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With