For a Recommender System, I need to compute the cosine similarity between all the columns of a whole Spark DataFrame. In Pandas I used to do this: <pre class="prettyprint"><code>import sklearn.metrics as metrics import pandas as pd df= pd.DataFrame(...some dataframe over here :D ...) metrics.pairwise.cosine_similarity(df.T,df.T) </code></pre> That generates the Similarity Matrix between the columns (since I used the transposition) Is there any way to do the same thing in Spark (Python)? (I need to apply this to a matrix made of tens of millions of rows, and thousands of columns, so that's why I need to do it in Spark)

You can use the built-in <code>columnSimilarities()</code> method on a <code>RowMatrix</code>, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a <code>threshold</code>. Here's a small reproducible example: <pre class="prettyprint"><code>from pyspark.mllib.linalg.distributed import RowMatrix rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)]) # Convert to RowMatrix mat = RowMatrix(rows) # Calculate exact and approximate similarities exact = mat.columnSimilarities() approx = mat.columnSimilarities(0.05) # Output exact.entries.collect() [MatrixEntry(0, 2, 0.991935352214), MatrixEntry(1, 2, 0.998441152599), MatrixEntry(0, 1, 0.997463284056)] </code></pre>

Apache Spark Python Cosine Similarity over DataFrames

Tags:

python

apache-spark

apache-spark-sql

pyspark

cosine-similarity

For a Recommender System, I need to compute the cosine similarity between all the columns of a whole Spark DataFrame.

In Pandas I used to do this:

import sklearn.metrics as metrics
import pandas as pd

df= pd.DataFrame(...some dataframe over here :D ...)
metrics.pairwise.cosine_similarity(df.T,df.T)

That generates the Similarity Matrix between the columns (since I used the transposition)

Is there any way to do the same thing in Spark (Python)?

(I need to apply this to a matrix made of tens of millions of rows, and thousands of columns, so that's why I need to do it in Spark)

454

asked May 11 '17 17:05

Valerio Storch

1 Answers

You can use the built-in columnSimilarities() method on a RowMatrix, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold.

Here's a small reproducible example:

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])

# Convert to RowMatrix
mat = RowMatrix(rows)

# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)

# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
 MatrixEntry(1, 2, 0.998441152599),
 MatrixEntry(0, 1, 0.997463284056)]

175

answered Sep 19 '22 18:09

mtoto

Related questions
                            
                                Flask SQLAlchemy NOT NULL constraint failed on primary key
                            
                                Is it possible to download apk from google play programmatically to PC?
                            
                                Dynamically creating python class from a protobuf file at run time?
                            
                                Python manager.dict() is very slow compared to regular dict
                            
                                How do I search a list that is in a nested list (list of list) without loop in Python?
                            
                                Removing data between double squiggly brackets with nested sub brackets in python
                            
                                Iterate through a dictionary in reverse order (Python)
                            
                                Get a list of all private channels with Slack API
                            
                                Generate a n-dimensional array of coordinates in numpy
                            
                                Limiting execution time of embedded Python
                            
                                Compute first order derivative with MongoDB aggregation framework
                            
                                How to include chromedriver with pyinstaller?
                            
                                unable to install JQ via PIP
                            
                                Merge pandas DataFrame on column of float values
                            
                                AttributeError: 'tuple' object has no attribute 'shape'
                            
                                Add module from RPM as a requirement
                            
                                Tracking changes to all models in Django
                            
                                Keras - How to use ImageDataGenerator without deforming aspect ratio
                            
                                Group by without an aggregate function
                            
                                Can't change activations in existing Keras model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With