In Scala/Spark, having a dataframe: <pre class="prettyprint"><code>val dfIn = sqlContext.createDataFrame(Seq( ("r0", 0, 2, 3), ("r1", 1, 0, 0), ("r2", 0, 2, 2))).toDF("id", "c0", "c1", "c2") </code></pre> I would like to compute a new column <code>maxCol</code> holding the name of the column corresponding to the max value (for each row). With this example, the output should be: <pre class="prettyprint lang-none prettyprint-override"><code>+---+---+---+---+------+ | id| c0| c1| c2|maxCol| +---+---+---+---+------+ | r0| 0| 2| 3| c2| | r1| 1| 0| 0| c0| | r2| 0| 2| 2| c1| +---+---+---+---+------+ </code></pre> Actually the dataframe have more than 60 columns. Thus a generic solution is required. The equivalent in Python Pandas (yes, I know, I should compare with pyspark...) could be: <pre class="prettyprint"><code>dfOut = pd.concat([dfIn, dfIn.idxmax(axis=1).rename('maxCol')], axis=1) </code></pre>

With a small trick you can use <code>greatest</code> function. Required imports: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.{col, greatest, lit, struct} </code></pre> First let's create a list of <code>structs</code>, where the first element is value, and the second one column name: <pre class="prettyprint"><code>val structs = dfIn.columns.tail.map( c => struct(col(c).as("v"), lit(c).as("k")) ) </code></pre> Structure like this can be passed to <code>greatest</code> as follows: <pre class="prettyprint"><code>dfIn.withColumn("maxCol", greatest(structs: _*).getItem("k")) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+---+---+---+------+ | id| c0| c1| c2|maxCol| +---+---+---+---+------+ | r0| 0| 2| 3| c2| | r1| 1| 0| 0| c0| | r2| 0| 2| 2| c2| +---+---+---+---+------+ </code></pre> Please note that in case of ties it will take the element which occurs later in the sequence (lexicographically <code>(x, "c2") > (x, "c1")</code>). If for some reason this is not acceptable you can explicitly reduce with <code>when</code>: <pre class="prettyprint"><code>import org.apache.spark.sql.functions.when val max_col = structs.reduce( (c1, c2) => when(c1.getItem("v") >= c2.getItem("v"), c1).otherwise(c2) ).getItem("k") dfIn.withColumn("maxCol", max_col) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+---+---+---+------+ | id| c0| c1| c2|maxCol| +---+---+---+---+------+ | r0| 0| 2| 3| c2| | r1| 1| 0| 0| c0| | r2| 0| 2| 2| c1| +---+---+---+---+------+ </code></pre> In case of <code>nullable</code> columns you have to adjust this, for example by <code>coalescing</code> to values to <code>-Inf</code>.

Scala/Spark dataframes: find the column name corresponding to the max

Tags:

dataframe

scala

apache-spark

apache-spark-sql

argmax

In Scala/Spark, having a dataframe:

val dfIn = sqlContext.createDataFrame(Seq(
  ("r0", 0, 2, 3),
  ("r1", 1, 0, 0),
  ("r2", 0, 2, 2))).toDF("id", "c0", "c1", "c2")

I would like to compute a new column maxCol holding the name of the column corresponding to the max value (for each row). With this example, the output should be:

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c1|
+---+---+---+---+------+

Actually the dataframe have more than 60 columns. Thus a generic solution is required.

The equivalent in Python Pandas (yes, I know, I should compare with pyspark...) could be:

dfOut = pd.concat([dfIn, dfIn.idxmax(axis=1).rename('maxCol')], axis=1)

932

asked Feb 27 '17 11:02

ivankeller

1 Answers

With a small trick you can use greatest function. Required imports:

import org.apache.spark.sql.functions.{col, greatest, lit, struct}

First let's create a list of structs, where the first element is value, and the second one column name:

val structs = dfIn.columns.tail.map(
  c => struct(col(c).as("v"), lit(c).as("k"))
)

Structure like this can be passed to greatest as follows:

dfIn.withColumn("maxCol", greatest(structs: _*).getItem("k"))

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c2|
+---+---+---+---+------+

Please note that in case of ties it will take the element which occurs later in the sequence (lexicographically (x, "c2") > (x, "c1")). If for some reason this is not acceptable you can explicitly reduce with when:

import org.apache.spark.sql.functions.when

val max_col = structs.reduce(
  (c1, c2) => when(c1.getItem("v") >= c2.getItem("v"), c1).otherwise(c2)
).getItem("k")

dfIn.withColumn("maxCol", max_col)

+---+---+---+---+------+
| id| c0| c1| c2|maxCol|
+---+---+---+---+------+
| r0|  0|  2|  3|    c2|
| r1|  1|  0|  0|    c0|
| r2|  0|  2|  2|    c1|
+---+---+---+---+------+

In case of nullable columns you have to adjust this, for example by coalescing to values to -Inf.

answered Sep 19 '22 21:09

zero323

Related questions
                            
                                How to specify different application.conf for specs2 tests?
                            
                                How to find source of scala.MatchError?
                            
                                How to get default property values in Spark
                            
                                How to encode categorical features in Apache Spark
                            
                                How to I convert long (currentTimeInMillis) to UTC timestamp?
                            
                                Scala: strange behavior in `for` pattern matching for None case
                            
                                Is there a configuration file for Scala REPL / SBT Console?
                            
                                Scala Slick 3.0 implicit mapping between java8 OffsetDateTime and Timestamp
                            
                                How to submit a Scala job to Spark?
                            
                                Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?
                            
                                Akka HTTP 2.0 to use SSL (HTTPS)
                            
                                Register UDF to SqlContext from Scala to use in PySpark
                            
                                How to custom code folding for Scala in intelliJ IDEA?
                            
                                Spark 2.0 DataSets groupByKey and divide operation and type safety
                            
                                spark kafka producer serializable
                            
                                Sort by dateTime in scala
                            
                                Isn't lambda function also an object with Function1 trait?
                            
                                Spark Dataframes- Reducing By Key
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With