Maybe it's just because I'm relatively new to the API, but I feel like Spark ML methods often return DFs that are unnecessarily difficult to work with. This time, it's the ALS model that's tripping me up. Specifically, the recommendForAllUsers method. Let's reconstruct the type of DF it would return: <pre class="prettyprint lang-scala prettyprint-override"><code>scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType)) scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))). toDF("userId", "recommendations"). select($"userId", $"recommendations".cast(arrayType)) scala> recs.show() </code></pre> <pre class="prettyprint"><code>+------+------------------+ |userId| recommendations| +------+------------------+ | 1|[[1,0.7], [2,0.5]]| | 2|[[0,0.9], [4,0.1]]| +------+------------------+ </code></pre> <pre class="prettyprint lang-scala prettyprint-override"><code>scala> recs.printSchema </code></pre> <pre class="prettyprint"><code>root |-- userId: integer (nullable = false) |-- recommendations: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- itemId: integer (nullable = true) | | |-- rating: float (nullable = true) </code></pre> Now, I only care about the <code>itemId</code> in the <code>recommendations</code> column. After all, the method is <code>recommendForAllUsers</code> not <code>recommendAndScoreForAllUsers</code> (ok ok I'll stop being sassy...) How do I do this?? I thought I had it when I created a UDF: <pre class="prettyprint lang-scala prettyprint-override"><code>scala> val itemIds = udf((arr: Array[(Int, Float)]) => arr.map(_._1)) </code></pre> but that produces an error: <pre class="prettyprint"><code>scala> recs.withColumn("items", items($"recommendations")) org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(recommendations)' due to data type mismatch: argument 1 requires array<struct<_1:int,_2:float>> type, however, '`recommendations`' is of array<struct<itemId:int,rating:float>> type.;; 'Project [userId#87, recommendations#92, UDF(recommendations#92) AS items#238] +- Project [userId#87, cast(recommendations#88 as array<struct<itemId:int,rating:float>>) AS recommendations#92] +- Project [_1#84 AS userId#87, _2#85 AS recommendations#88] +- LocalRelation [_1#84, _2#85] </code></pre> Any ideas? thanks!

wow, my coworker came up with an extremely elegant solution: <pre class="prettyprint"><code>scala> recs.select($"userId", $"recommendations.itemId").show +------+------+ |userId|itemId| +------+------+ | 1|[1, 2]| | 2|[0, 4]| +------+------+ </code></pre> So maybe the Spark ML API isn't that difficult after all :)

How to flatten columns of type array of structs (as returned by Spark ML API)?

Tags:

apache-spark

apache-spark-sql

apache-spark-ml

Maybe it's just because I'm relatively new to the API, but I feel like Spark ML methods often return DFs that are unnecessarily difficult to work with.

This time, it's the ALS model that's tripping me up. Specifically, the recommendForAllUsers method. Let's reconstruct the type of DF it would return:

scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))

scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
  toDF("userId", "recommendations").
  select($"userId", $"recommendations".cast(arrayType))

scala> recs.show()

+------+------------------+
|userId|   recommendations|
+------+------------------+
|     1|[[1,0.7], [2,0.5]]|
|     2|[[0,0.9], [4,0.1]]|
+------+------------------+

scala> recs.printSchema

root
 |-- userId: integer (nullable = false)
 |-- recommendations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- itemId: integer (nullable = true)
 |    |    |-- rating: float (nullable = true)

Now, I only care about the itemId in the recommendations column. After all, the method is recommendForAllUsers not recommendAndScoreForAllUsers (ok ok I'll stop being sassy...)

How do I do this??

I thought I had it when I created a UDF:

scala> val itemIds = udf((arr: Array[(Int, Float)]) => arr.map(_._1))

but that produces an error:

scala> recs.withColumn("items", items($"recommendations"))
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(recommendations)' due to data type mismatch: argument 1 requires array<struct<_1:int,_2:float>> type, however, '`recommendations`' is of array<struct<itemId:int,rating:float>> type.;;
'Project [userId#87, recommendations#92, UDF(recommendations#92) AS items#238]
+- Project [userId#87, cast(recommendations#88 as array<struct<itemId:int,rating:float>>) AS recommendations#92]
   +- Project [_1#84 AS userId#87, _2#85 AS recommendations#88]
      +- LocalRelation [_1#84, _2#85]

Any ideas? thanks!

321

asked Oct 13 '17 18:10

Nick Resnick

2 Answers

wow, my coworker came up with an extremely elegant solution:

scala> recs.select($"userId", $"recommendations.itemId").show
+------+------+
|userId|itemId|
+------+------+
|     1|[1, 2]|
|     2|[0, 4]|
+------+------+

So maybe the Spark ML API isn't that difficult after all :)

answered Oct 25 '22 13:10

Nick Resnick

With an array as the type of a column, e.g. recommendations, you'd be quite productive using explode function (or the more advanced flatMap operator).

explode(e: Column): Column Creates a new row for each element in the given array or map column.

That gives you bare structs to work with.

import org.apache.spark.sql.types._
val structType = new StructType().
  add($"itemId".int).
  add($"rating".float)
val arrayType = ArrayType(structType)
val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
  toDF("userId", "recommendations").
  select($"userId", $"recommendations" cast arrayType)

val exploded = recs.withColumn("recs", explode($"recommendations"))
scala> exploded.show
+------+------------------+-------+
|userId|   recommendations|   recs|
+------+------------------+-------+
|     1|[[1,0.7], [2,0.5]]|[1,0.7]|
|     1|[[1,0.7], [2,0.5]]|[2,0.5]|
|     2|[[0,0.9], [4,0.1]]|[0,0.9]|
|     2|[[0,0.9], [4,0.1]]|[4,0.1]|
+------+------------------+-------+

structs are nice in select operator with * (star) to flatten them to columns per the struct fields.

You could do select($"element.*").

scala> exploded.select("userId", "recs.*").show
+------+------+------+
|userId|itemId|rating|
+------+------+------+
|     1|     1|   0.7|
|     1|     2|   0.5|
|     2|     0|   0.9|
|     2|     4|   0.1|
+------+------+------+

I think that could do what you're after.

p.s. Stay away from UDFs as long as possible since they "trigger" row conversion from the internal format (InternalRow) to JVM objects that can lead to excessive GCs.

answered Oct 25 '22 14:10

Jacek Laskowski

Related questions
                            
                                Yarn container is running out of memory
                            
                                Apache Spark: How do I convert a Spark DataFrame to a RDD with type RDD[(Type1,Type2, ...)]?
                            
                                Error when creating a StreamingContext
                            
                                Register UDF to SqlContext from Scala to use in PySpark
                            
                                pandas str.contains in pyspark dataframe in Pyspark
                            
                                How to define Kafka (data source) dependencies for Spark Streaming?
                            
                                Spark 2.0 DataSets groupByKey and divide operation and type safety
                            
                                SPARK, DataFrame: difference of Timestamp columns over consecutive rows
                            
                                spark kafka producer serializable
                            
                                SPARK: YARN kills containers for exceeding memory limits
                            
                                Sort by dateTime in scala
                            
                                Spark Dataframes- Reducing By Key
                            
                                How to reference a dataframe when in an UDF on another dataframe?
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?
                            
                                Scala/Spark dataframes: find the column name corresponding to the max
                            
                                Apache Spark how to append new column from list/array to Spark dataframe
                            
                                Pyspark: Is there an equivalent method to pandas info()?
                            
                                Getting last value of group in Spark
                            
                                How to read streaming data in XML format from Kafka?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to flatten columns of type array of structs (as returned by Spark ML API)?

Tags:

apache-spark

apache-spark-sql

apache-spark-ml

Nick Resnick

People also ask

2 Answers

Nick Resnick

Jacek Laskowski

Recent Activity

Donate For Us