I need to implement the below SQL logic in Spark <code>DataFrame</code> <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT KEY, CASE WHEN tc in ('a','b') THEN 'Y' WHEN tc in ('a') AND amt > 0 THEN 'N' ELSE NULL END REASON, FROM dataset1; </code></pre> My input <code>DataFrame</code> is as below: <pre class="prettyprint lang-scala prettyprint-override"><code>val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt") dataset1.show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+---+---+ |KEY| tc|amt| +---+---+---+ | 66| a| 4| | 67| a| 0| | 70| b| 4| | 71| d| 4| +---+---+---+ </code></pre> I have implement the nested case when statement as: <pre class="prettyprint lang-scala prettyprint-override"><code>dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y") .otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N") .otherwise(null))).show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+---+---+------+ |KEY| tc|amt|REASON| +---+---+---+------+ | 66| a| 4| Y| | 67| a| 0| Y| | 70| b| 4| Y| | 71| d| 4| null| +---+---+---+------+ </code></pre> Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further. Is there any better way of implementing nested case when statements in Spark <code>DataFrames</code>?

There is no nesting here, therefore there is no need for <code>otherwise</code>. All you need is chained <code>when</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import spark.implicits._ when($"tc" isin ("a", "b"), "Y") .when($"tc" === "a" && $"amt" >= 0, "N") </code></pre> <code>ELSE NULL</code> is implicit so you can omit it completely. Pattern you use, is more more applicable for <code>folding</code> over a data structure: <pre class="prettyprint lang-scala prettyprint-override"><code>val cases = Seq( ($"tc" isin ("a", "b"), "Y"), ($"tc" === "a" && $"amt" >= 0, "N") ) </code></pre> where <code>when</code> - <code>otherwise</code> naturally follows recursion pattern and <code>null</code> provides the base case. <pre class="prettyprint lang-scala prettyprint-override"><code>cases.foldLeft(lit(null)) { case (acc, (expr, value)) => when(expr, value).otherwise(acc) } </code></pre> Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If <code>tc</code> is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to <code>NULL</code>. You should rather: <pre class="prettyprint lang-scala prettyprint-override"><code>when($"tc" === "a" && $"amt" >= 0, "N") .when($"tc" isin ("a", "b"), "Y") </code></pre>

For more complex logic, I prefer to use UDFs for better readability: <pre class="prettyprint"><code>val selectCase = udf((tc: String, amt: String) => if (Seq("a", "b").contains(tc)) "Y" else if (tc == "a" && amt.toInt <= 0) "N" else null ) dataset1.withColumn("REASON", selectCase(col("tc"), col("amt"))) .show </code></pre>

Spark Dataframe Nested Case When Statement

Tags:

sql

dataframe

apache-spark

apache-spark-sql

I need to implement the below SQL logic in Spark DataFrame

Click to copy

SELECT KEY,
    CASE WHEN tc in ('a','b') THEN 'Y'
         WHEN tc in ('a') AND amt > 0 THEN 'N'
         ELSE NULL END REASON,
FROM dataset1;

My input DataFrame is as below:

Click to copy

val dataset1 = Seq((66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4")).toDF("KEY", "tc", "amt")

dataset1.show()

Click to copy

+---+---+---+
|KEY| tc|amt|
+---+---+---+
| 66|  a|  4|
| 67|  a|  0|
| 70|  b|  4|
| 71|  d|  4|
+---+---+---+

I have implement the nested case when statement as:

Click to copy

dataset1.withColumn("REASON", when(col("tc").isin("a", "b"), "Y")
  .otherwise(when(col("tc").equalTo("a") && col("amt").geq(0), "N")
    .otherwise(null))).show()

Click to copy

+---+---+---+------+
|KEY| tc|amt|REASON|
+---+---+---+------+
| 66|  a|  4|     Y|
| 67|  a|  0|     Y|
| 70|  b|  4|     Y|
| 71|  d|  4|  null|
+---+---+---+------+

Readability of the above logic with "otherwise" statement is little messy if the nested when statements goes further.

Is there any better way of implementing nested case when statements in Spark DataFrames?

592

asked Oct 09 '17 07:10

RaAm

2 Answers

There is no nesting here, therefore there is no need for otherwise. All you need is chained when:

Click to copy

import spark.implicits._

when($"tc" isin ("a", "b"), "Y")
  .when($"tc" === "a" && $"amt" >= 0, "N")

ELSE NULL is implicit so you can omit it completely.

Pattern you use, is more more applicable for folding over a data structure:

Click to copy

val cases = Seq(
  ($"tc" isin ("a", "b"), "Y"),
  ($"tc" === "a" && $"amt" >= 0, "N")
)

where when - otherwise naturally follows recursion pattern and null provides the base case.

Click to copy

cases.foldLeft(lit(null)) {
  case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}

Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tc is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL. You should rather:

Click to copy

when($"tc" === "a" && $"amt" >= 0, "N")
 .when($"tc" isin ("a", "b"), "Y")

answered Sep 18 '22 23:09

zero323

For more complex logic, I prefer to use UDFs for better readability:

Click to copy

val selectCase = udf((tc: String, amt: String) =>
  if (Seq("a", "b").contains(tc)) "Y"
  else if (tc == "a" && amt.toInt <= 0) "N"
  else null
)


dataset1.withColumn("REASON", selectCase(col("tc"), col("amt")))
  .show

answered Sep 22 '22 23:09

Raphael Roth

Related questions
                            
                                Keeping a history of data changes in database
                            
                                Selecting the most common value from relation - SQL statement
                            
                                Xquery get value from attribute
                            
                                The C# using statement, SQL, and SqlConnection
                            
                                Combining two fields in SELECT statement
                            
                                Sql query, selecting on unique identifier gives - Error converting data type varchar to uniqueidentifier
                            
                                Strange Oracle error: Identifier too long ORA-00972
                            
                                What is equivalent of mysql_insert_id(); using prepared statement?
                            
                                SQL set floating point precision
                            
                                How to get SQLCMD to output errors and warnings only
                            
                                Padding a string in Postgresql with rpad without truncating it
                            
                                T-SQL Optimize DELETE of many records
                            
                                Simple getColumnName(0) call throws Invalid column index: getValidColumnIndex
                            
                                How to write a cursor inside a stored procedure in SQL Server 2008
                            
                                SQL and NoSQL Analogy for the Non-Technical [closed]
                            
                                Select a row of first non-null values in a sparse table
                            
                                Updating specific row in SQLAlchemy
                            
                                How to get rows with max date when grouping in MySQL?
                            
                                Looping through SELECT result set in SQL
                            
                                SQL query with multiple conditions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Dataframe Nested Case When Statement

Tags:

sql

dataframe

apache-spark

apache-spark-sql

RaAm

People also ask

2 Answers

zero323

Raphael Roth

Recent Activity

Donate For Us