Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does array_contains accept columns for both arguments in SQL but not in Dataset API?

I've been reviewing questions and answers about array_contains (and isin) methods on StackOverflow and I still cannot answer the following question:

Why does array_contains in SQL accept columns (references) as its arguments while the standard function does not?

I can understand that the above question could easily be marked as "primarily opinion-based" or similar so let me rephrase it to the following:

How to use array_contains standard function so it accepts arguments (the values) from columns?

scala> spark.version
res0: String = 2.3.0

val codes = Seq(
  (Seq(1, 2, 3), 2),
  (Seq(1), 1),
  (Seq.empty[Int], 1),
  (Seq(2, 4, 6), 0)).toDF("codes", "cd")
scala> codes.show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
|       []|  1|
|[2, 4, 6]|  0|
+---------+---+

// array_contains in SQL mode works with arguments being columns
val q = codes.where("array_contains(codes, cd)")
scala> q.show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
+---------+---+

// array_contains standard function with Columns does NOT work. Why?!
// How to change it so it would work (without reverting to SQL expr)?
scala> val q = codes.where(array_contains($"codes", $"cd"))
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName cd
  at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
  at org.apache.spark.sql.functions$.array_contains(functions.scala:2988)
  ... 49 elided
like image 444
Jacek Laskowski Avatar asked May 18 '18 13:05

Jacek Laskowski


1 Answers

Since the underlying function ArrayContains does take expr arguements you can always cheat a little bit.

scala> codes.where(new Column(ArrayContains($"codes".expr, $"cd".expr))).show
+---------+---+
|    codes| cd|
+---------+---+
|[1, 2, 3]|  2|
|      [1]|  1|
+---------+---+
**/

Like user9812147 said, the issue here is just that the SQL Parser is able to access the ArrayContains function directly. While it seems the direct function call forces the "values" portion to be treated as a Literal.

like image 73
RussS Avatar answered Sep 22 '22 14:09

RussS