Most efficient way to calculate pairwise similarity of 250k lists

Tags:

I have 250,000 lists containing an average of 100 strings each, stored across 10 dictionaries. I need to calculate the pairwise similarity of all lists (the similarity metric isn't relevant here; but, briefly, it involves taking the intersection of the two lists and normalizing the result by some constant).

The code I've come up with for the pairwise comparisons is quite straightforward. I'm just using itertools.product to compare every list to every other list. The problem is performing these calculations on 250,000 lists in a time-efficient way. To anyone who's dealt with a similar problem: Which of the usual options (scipy, PyTables) is best for this in terms of the following criteria:

supports python data types
smartly stores a very sparse matrix (approx 80% of the values will be 0)
efficient (can do the calculations in under 10 hours)

471

asked Jan 10 '13 22:01

Renklauf

1 Answers

Do you just want the most efficient way to determine the distance between any two points in your data?

Or do you actually need this m x m distance matrix that stores all pair-wise similarity values for all rows in your data?

Usually it's far more efficient to persist your data in some metric space, using a data structure optimized for rapid retrieval, than it is to pre-calculate the pair-wise similarity values in advance and just look them up. Needless to say, the distance matrix option scales horribly-- n data points requires an n x n distance matrix to store the pair-wise similarity scores.

A kd-tree is the technique of choice for data of small dimension ("small" here means something like number of features less than about 20); Voronoi tesselation is often preferred for higher dimension data.

Much more recently, the ball tree has been used as a superior alternative to both--it has the performance of the kd-tree but without the degradation at high dimension.

scikit-learn has an excellent implementation which includes unit tests. It is well-documented and currently under active development.

scikit-learn is built on NumPy and SciPy and so both are dependencies. The various installation options for scikit-learn are provided on the Site.

The most common use case for Ball Trees is in k-Nearest Neighbors; but it will work quite well on its own, eg., in cases like the one described in the OP.

you can use the scikit-learn Ball Tree implementation like so:

>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)

>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree

>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:

>>> BT = BallTree(D, leaf_size=5, p=2)

>>> # 'leaf size' specifies the data (number of points) at which 
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)

>>> type(BT)
    <type 'sklearn.neighbors.ball_tree.BallTree'>

instantiating & populating the ball tree is very fast (timed using Corey Goldberg's timer class):

>>> with Timer() as t:
        BT = BallTree(D, leaf_size=5)

>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
        'ball tree instantiated & populated in 13.90 milliseconds'

querying the ball tree is also fast:

an example query: provide the three data points closest to the data point row index 500; and for each of them, return their index and their distance from this reference point at D[500,:]

>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points

>>> dx, idx = BT.query(D[500,:], k=3)

>>> dx    # distance
    array([[ 0.   ,  1.206,  1.58 ]])

>>> idx    # index
    array([[500, 556, 373]], dtype=int32)

>>> with Timer() as t:
    dx, idx = BT.query(D[500,:], k=3)


>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
        'query results returned in 15.85 milliseconds'

The default distance metric in the scikit-learn Ball Tree implementation is Minkowski, which is just a generalization of Euclidean and Manhattan (ie, in the Minkowski expression, there is a parameter, p, which when set to 2 collapses to Euclidean, and Manhattan, for p=1.

119

answered Nov 14 '22 01:11

doug

Related questions
                            
                                nginx + uwsgi + flask - disabling custom error pages
                            
                                scrapy authentication login with cookies
                            
                                Why do up and down arrow commands not work in the Python command line interpreter?
                            
                                Google Cloud Messaging HTTP Error 400: Bad Request
                            
                                double quotes in string representation
                            
                                How do I print out the full url with tweepy?
                            
                                Labels for Django select form field
                            
                                Python multiprocessing+logging.FileHandler
                            
                                Using @ndb.tasklet or @ndb.synctasklet in Google App Engine
                            
                                How to expand input buffer size of pyserial
                            
                                Escape key to Close WxPython GUI
                            
                                Is there any way to run a python script on remote machine without sending it?
                            
                                Job processing via web application: real-time status updates and backend messaging
                            
                                Debugging options w/ Python, Flask and Sublime Text 2
                            
                                How to get VBOs to work with Python and PyOpenGL
                            
                                Pandas rolling apply with missing data
                            
                                CSV export in Stream (from Django admin on Heroku)
                            
                                Pythonic way to import data from multiple files into an array
                            
                                Python - Understanding the send function of a generator
                            
                                Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most efficient way to calculate pairwise similarity of 250k lists

Tags:

python

matrix

Renklauf

People also ask

1 Answers

doug

Recent Activity

Donate For Us