I have 250,000 lists containing an average of 100 strings each, stored across 10 dictionaries. I need to calculate the pairwise similarity of all lists (the similarity metric isn't relevant here; but, briefly, it involves taking the intersection of the two lists and normalizing the result by some constant).
The code I've come up with for the pairwise comparisons is quite straightforward. I'm just using itertools.product to compare every list to every other list. The problem is performing these calculations on 250,000 lists in a time-efficient way. To anyone who's dealt with a similar problem: Which of the usual options (scipy, PyTables) is best for this in terms of the following criteria:
By “pairwise”, we mean that we have to compute similarity for each pair of points. That means the computation will be O(M*N) where M is the size of the first set of points and N is the size of the second set of points.
Similarity Matrix: Similarity matrix is the opposite concept to the distance matrix . The elements of a similarity matrix measure pairwise similarities of objects - the greater similarity of two objects, the greater the value of the measure.
To create a similarity matrix we take two lines and compute a score for the difference between them. We do this for the selected set of lines and across the selected set of markers. The score for the difference between two lines is calculated by comparing each allele of line1 against each allele of line2.
We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.
Do you just want the most efficient way to determine the distance between any two points in your data?
Or do you actually need this m x m distance matrix that stores all pair-wise similarity values for all rows in your data?
Usually it's far more efficient to persist your data in some metric space, using a data structure optimized for rapid retrieval, than it is to pre-calculate the pair-wise similarity values in advance and just look them up. Needless to say, the distance matrix option scales horribly-- n data points requires an n x n distance matrix to store the pair-wise similarity scores.
A kd-tree is the technique of choice for data of small dimension ("small" here means something like number of features less than about 20); Voronoi tesselation is often preferred for higher dimension data.
Much more recently, the ball tree has been used as a superior alternative to both--it has the performance of the kd-tree but without the degradation at high dimension.
scikit-learn has an excellent implementation which includes unit tests. It is well-documented and currently under active development.
scikit-learn is built on NumPy and SciPy and so both are dependencies. The various installation options for scikit-learn are provided on the Site.
The most common use case for Ball Trees is in k-Nearest Neighbors; but it will work quite well on its own, eg., in cases like the one described in the OP.
you can use the scikit-learn Ball Tree implementation like so:
>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)
>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree
>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:
>>> BT = BallTree(D, leaf_size=5, p=2)
>>> # 'leaf size' specifies the data (number of points) at which
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)
>>> type(BT)
<type 'sklearn.neighbors.ball_tree.BallTree'>
instantiating & populating the ball tree is very fast (timed using Corey Goldberg's timer class):
>>> with Timer() as t:
BT = BallTree(D, leaf_size=5)
>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
'ball tree instantiated & populated in 13.90 milliseconds'
querying the ball tree is also fast:
an example query: provide the three data points closest to the data point row index 500; and for each of them, return their index and their distance from this reference point at D[500,:]
>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points
>>> dx, idx = BT.query(D[500,:], k=3)
>>> dx # distance
array([[ 0. , 1.206, 1.58 ]])
>>> idx # index
array([[500, 556, 373]], dtype=int32)
>>> with Timer() as t:
dx, idx = BT.query(D[500,:], k=3)
>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
'query results returned in 15.85 milliseconds'
The default distance metric in the scikit-learn Ball Tree implementation is Minkowski, which is just a generalization of Euclidean and Manhattan (ie, in the Minkowski expression, there is a parameter, p, which when set to 2 collapses to Euclidean, and Manhattan, for p=1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With