I want to write an Spark 1.6 UDF which takes the following map: <pre class="prettyprint"><code>case class MyRow(mapping: Map[(Int, Int), Double]) val data = Seq( MyRow(Map((1, 1) -> 1.0)) ) val df = sc.parallelize(data).toDF() df.printSchema() root |-- mapping: map (nullable = true) | |-- key: struct | |-- value: double (valueContainsNull = false) | | |-- _1: integer (nullable = false) | | |-- _2: integer (nullable = false) </code></pre> (As a side-note: I find the above output strange as the Type of the key is printed below the type of the value, why is that?) Now I define my UDF as: <pre class="prettyprint"><code>val myUDF = udf((inputMapping: Map[(Int,Int), Double]) => inputMapping.map { case ((i1, i2), value) => ((i1 + i2), value) } ) df .withColumn("udfResult", myUDF($"mapping")) .show() </code></pre> But this gives me: <pre class="prettyprint"><code>java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2 </code></pre> So I tried to replace <code>(Int,Int)</code> with a custom <code>case class</code>, because this is how I normally do it if I want to pass a <code>struct</code> to an UDF: <pre class="prettyprint"><code>case class MyTuple2(i1: Int, i2: Int) val myUDF = udf((inputMapping: Map[MyTuple2, Double]) => inputMapping.map { case (MyTuple2(i1, i2), value) => ((i1 + i2), value) } ) </code></pre> This strangely gives : <pre class="prettyprint"><code>org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(mapping)' due to data type mismatch: argument 1 requires map<struct<i1:int,i2:int>,double> type, however, 'mapping' is of map<struct<_1:int,_2:int>,double> type. </code></pre> I don't understand the above exception as the types match. The only (ugly) solution I've found is passing a <code>org.apache.spark.sql.Row</code> and then "extract" the elements of the struct: <pre class="prettyprint"><code>val myUDF = udf((inputMapping: Map[Row, Double]) => inputMapping .map { case (key, value) => ((key.getInt(0), key.getInt(1)), value) } // extract Row into Tuple2 .map { case ((i1, i2), value) => ((i1 + i2), value) } ) </code></pre>

As far as I know, there's no escaping the use of <code>Row</code> in this context: a tuple (or case class) used within a map (or another tuple/case class/array...) is a nested structure, and as such it would be represented as a <code>Row</code> when passed into a UDF. The only improvement I can suggest is using <code>Row.unapply</code> to simplify the code a bit: <pre class="prettyprint"><code>val myUDF = udf((inputMapping: Map[Row, Double]) => inputMapping .map { case (Row(i1: Int, i2: Int), value) => (i1 + i2, value) } ) </code></pre>

Passing a map with struct-type key into a Spark UDF

Tags:

scala

apache-spark

I want to write an Spark 1.6 UDF which takes the following map:

case class MyRow(mapping: Map[(Int, Int), Double])

val data = Seq(
  MyRow(Map((1, 1) -> 1.0))
)
val df = sc.parallelize(data).toDF()

df.printSchema()

root
 |-- mapping: map (nullable = true)
 |    |-- key: struct
 |    |-- value: double (valueContainsNull = false)
 |    |    |-- _1: integer (nullable = false)
 |    |    |-- _2: integer (nullable = false)

(As a side-note: I find the above output strange as the Type of the key is printed below the type of the value, why is that?)

Now I define my UDF as:

val myUDF = udf((inputMapping: Map[(Int,Int), Double]) =>
  inputMapping.map { case ((i1, i2), value) => ((i1 + i2), value) }
)

df
  .withColumn("udfResult", myUDF($"mapping"))
  .show()

But this gives me:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2

So I tried to replace (Int,Int) with a custom case class, because this is how I normally do it if I want to pass a struct to an UDF:

case class MyTuple2(i1: Int, i2: Int)
val myUDF = udf((inputMapping: Map[MyTuple2, Double]) => 
  inputMapping.map { case (MyTuple2(i1, i2), value) => ((i1 + i2), value) }
)

This strangely gives :

org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(mapping)' due to data type mismatch: argument 1 requires map<struct<i1:int,i2:int>,double> type, however, 'mapping' is of map<struct<_1:int,_2:int>,double> type.

I don't understand the above exception as the types match.

The only (ugly) solution I've found is passing a org.apache.spark.sql.Row and then "extract" the elements of the struct:

val myUDF = udf((inputMapping: Map[Row, Double]) => inputMapping
  .map { case (key, value) => ((key.getInt(0), key.getInt(1)), value) } // extract Row into Tuple2
  .map { case ((i1, i2), value) => ((i1 + i2), value) }
)

793

asked Jan 23 '17 12:01

Raphael Roth

1 Answers

As far as I know, there's no escaping the use of Row in this context: a tuple (or case class) used within a map (or another tuple/case class/array...) is a nested structure, and as such it would be represented as a Row when passed into a UDF.

The only improvement I can suggest is using Row.unapply to simplify the code a bit:

val myUDF = udf((inputMapping: Map[Row, Double]) => inputMapping
  .map { case (Row(i1: Int, i2: Int), value) => (i1 + i2, value) }
)

127

answered Sep 20 '22 21:09

Tzach Zohar

Related questions
                            
                                Recurrent call to a function until it returns None
                            
                                Does Inheritance in implicit value classes introduce an overhead?
                            
                                Akka with Frege running slower than Scala counterpart
                            
                                How can I use a combination of Scala, Groovy, and Java code with Gradle?
                            
                                How to use Slick's mapped tables with foreign keys?
                            
                                Existensial types in Scala
                            
                                Setting unique snapshot version when cross-building in SBT
                            
                                Does Slick support changing the schema dynamically per query?
                            
                                Scala Slick 2 join on multiple fields?
                            
                                Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed
                            
                                Sort list of string with localization in scala
                            
                                IntelliJ IDEA w/ Scala Plugin not finding scala.concurrent
                            
                                Spark streaming DStream RDD to get file name
                            
                                Using a custom class loader for a module dependency in SBT
                            
                                Why does IDEA not resolve scala.reflect, but scala-reflect is included in project settings?
                            
                                Create Spark DataFrame in Spark Streaming from JSON Message on Kafka
                            
                                Spark forcing log4j
                            
                                Splitting an HList that was concatenated using Prepend[A, B]
                            
                                Accessing HDFS HA from spark job (UnknownHostException error)
                            
                                How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With