Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the max value of multiple columns?

I am trying to find the maximum value of multiple columns in a Spark dataframe. Each Column has a value of double type.

The dataframe is like:

+-----+---+----+---+---+
|Name | A | B  | C | D |
+-----+---+----+---+---+
|Alex |5.1|-6.2|  7|  8|
|John |  7| 8.3|  1|  2|
|Alice|  5|  46|  3|  2|
|Mark |-20| -11|-22| -5|
+-----+---+----+---+---+

The expectation is:

+-----+---+----+---+---+----------+
|Name | A | B  | C | D | MaxValue |
+-----+---+----+---+---+----------+
|Alex |5.1|-6.2|  7|  8|     8    |
|John |  7| 8.3|  1|  2|     8.3  | 
|Alice|  5|  46|  3|  2|     46   |
|Mark |-20| -11|-22| -5|     -5   |
+-----+---+----+---+---+----------+
like image 755
user2967251 Avatar asked Aug 16 '19 22:08

user2967251


People also ask

How do I find the maximum value in multiple columns in SQL?

If you're working with MySQL, you can combine MAX() with the GREATEST() function to get the biggest value from two or more fields. Here's the syntax for GREATEST: GREATEST(value1,value2,...) Given two or more arguments, it returns the largest (maximum-valued) argument.


1 Answers

You could apply greatest to the list of numeric columns, as shown below:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("Alex", 5.1, -6.2, 7.0, 8.0),
  ("John", 7.0, 8.3, 1.0, 2.0),
  ("Alice", 5.0, 46.0, 3.0, 2.0),
  ("Mark", -20.0, -11.0, -22.0, -5.0),
).toDF("Name", "A", "B", "C", "D")

val numCols = df.columns.tail  // Apply suitable filtering as needed (*)

df.withColumn("MaxValue", greatest(numCols.head, numCols.tail: _*)).
  show
// +-----+-----+-----+-----+----+--------+
// | Name|    A|    B|    C|   D|MaxValue|
// +-----+-----+-----+-----+----+--------+
// | Alex|  5.1| -6.2|  7.0| 8.0|     8.0|
// | John|  7.0|  8.3|  1.0| 2.0|     8.3|
// |Alice|  5.0| 46.0|  3.0| 2.0|    46.0|
// | Mark|-20.0|-11.0|-22.0|-5.0|    -5.0|
// +-----+-----+-----+-----+----+--------+

(*) For example, to filter for all top-level DoubleType columns:

import org.apache.spark.sql.types._

val numCols = df.schema.fields.collect{
  case StructField(name, DoubleType, _, _) => name
}

If you're on Spark 2.4+, an alternative would be to use array_max, although it would involve an additional step of transformation in this case:

df.withColumn("MaxValue", array_max(array(numCols.map(col): _*)))
like image 81
Leo C Avatar answered Sep 23 '22 00:09

Leo C