Calculate Cosine Similarity Spark Dataframe

Tags:

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe format is below

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)

Sample of the dataframe below

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

I am tried the below sample code

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

But I got the below error

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

856

asked Oct 30 '17 07:10

Moustafa Mahmoud

1 Answers

DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
Index has to be Long not string.

import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) => 
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })

answered Sep 23 '22 22:09

Alper t. Turker

Related questions
                            
                                How do I generate a random number using functional state?
                            
                                Conditionally include provided scope dependencies with sbt and the universal plugin
                            
                                How to run Spark Scala code on Amazon EMR
                            
                                Scala cats, traverse Seq
                            
                                Single documentation for mixed (Scala/Java) project?
                            
                                What is the most elegant way to deal with an external library with internal state using a function programming language?
                            
                                How to fallback Scala version for SBT dependencies?
                            
                                Filtering resources in SBT
                            
                                Scala 'fromFile' weirdness?
                            
                                Overriding Java interface with overloaded vargs methods in Scala
                            
                                In Scala 2.10, how do you create a ClassTag given a TypeTag
                            
                                Monadic fold with State monad in constant space (heap and stack)?
                            
                                Why "set" can't assign value to custom SettingKey I can "show" in sbt shell?
                            
                                Scala using toSet.toList vs distinct
                            
                                Verifying mocked object method calls with default arguments
                            
                                Merge Sets of Sets that contain common elements in Scala
                            
                                How does @Inject in Scala work
                            
                                Using Slick with shapeless HList
                            
                                Spark: difference of semantics between reduce and reduceByKey
                            
                                Spark dataframe to arrow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate Cosine Similarity Spark Dataframe

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-mllib

Moustafa Mahmoud

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us