Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate Cosine Similarity Spark Dataframe

I am using Spark Scala to calculate cosine similarity between the Dataframe rows.

Dataframe format is below

root
    |-- SKU: double (nullable = true)
    |-- Features: vector (nullable = true)

Sample of the dataframe below

    +-------+--------------------+
    |    SKU|            Features|
    +-------+--------------------+
    | 9970.0|[4.7143,0.0,5.785...|
    |19676.0|[5.5,0.0,6.4286,4...|
    | 3296.0|[4.7143,1.4286,6....|
    |13658.0|[6.2857,0.7143,4....|
    |    1.0|[4.2308,0.7692,5....|
    |  513.0|[3.0,0.0,4.9091,5...|
    | 3753.0|[5.9231,0.0,4.846...|
    |14967.0|[4.5833,0.8333,5....|
    | 2803.0|[4.2308,0.0,4.846...|
    |11879.0|[3.1429,0.0,4.5,4...|
    +-------+--------------------+

I tried to transpose the matrix and check the following mentioned links.Apache Spark Python Cosine Similarity over DataFrames, calculating-cosine-similarity-by-featurizing-the-text-into-vector-using-tf-idf But I believe there is a better solution

I am tried the below sample code

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  case (v,i:Vector) => IndexedRow(v, i)


}).toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities

But I got the below error

Error:(80, 12) constructor cannot be instantiated to expected type;
 found   : (T1, T2)
 required: org.apache.spark.sql.Row
      case (v,i:Vector) => IndexedRow(v, i)

I checked the following Link Apache Spark: How to create a matrix from a DataFrame? But can't do it using Scala

like image 856
Moustafa Mahmoud Avatar asked Oct 30 '17 07:10

Moustafa Mahmoud


People also ask

How do you find the cosine similarity of spark?

You can use the built-in columnSimilarities() method on a RowMatrix , that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold .

How do you find cosine similarity in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

What is cosine similarity in information retrieval?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.


1 Answers

  • DataFrame.rdd returns RDD[Row] not RDD[(T, U)]. You have to pattern match the Row or directly extract interesting parts.
  • ml Vector used with Datasets since Spark 2.0 is not the same as mllib Vector use by old API. You have to convert it to use with IndexedRowMatrix.
  • Index has to be Long not string.
import org.apache.spark.sql.Row

val irm = new IndexedRowMatrix(inClusters.rdd.map {
  Row(_, v: org.apache.spark.ml.linalg.Vector) => 
    org.apache.spark.mllib.linalg.Vectors.fromML(v)
}.zipWithIndex.map { case (v, i) => IndexedRow(i, v) })
like image 59
Alper t. Turker Avatar answered Sep 23 '22 22:09

Alper t. Turker