How to get two lists that have the most elements in common in a nested list in Python

Tags:

algorithm

I have a list of list such as the following

[["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]

How would I get the two list that have the most words in common? In this case, it would be the first and second list because they both have the word test in common

I have tried solving this by finding the intersection of every combination of two lists and keeping track of the combination with the highest amount of words in common. This method however seems inefficient with say 100,000 lists. It would be (100,000 choose 2) combinations I think. Is there a faster way to do this?

This is my code attempt

from itertools import combinations

a = [["This", "is","a", "test"], ["test","something", "here"], ["cat", "dog", "fish"]]
c= combinations(a,2)


max = 0
for pair in c:
    intersection = set(pair[0]).intersection(pair[1]) 
    if len(intersection) > max:
        max = len(intersection)
        mostsimilar = pair

print mostsimilar

The output of my program is what I expected however it is very slow on bigger test cases

output:

(['This', 'is', 'a', 'test'], ['test', 'something', 'here'])

930

asked Apr 29 '19 21:04

yzx7243

1 Answers

Based on my understanding of the problem, I think this should work:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KDTree, DistanceMetric

data = np.array([' '.join(words) for words in [['This', 'is', 'a', 'test'], ['test', 'something', 'here'], ['cat', 'dog', 'fish']]])

vectorised = CountVectorizer(binary=True).fit_transform(data).todense()
metric = DistanceMetric.get_metric('manhattan')
kdtree = KDTree(vectorised, metric=metric)
distances, indices = [result[:, 1] for result in kdtree.query(vectorised, k=2, dualtree=True)]

nearest_distances = (distances == distances.min())
print(data[nearest_distances])

Output:

['This is a test' 'test something here']

I recast the problem in the following manner:

Each list of words (or, sentence) can be represented as a row in a sparse matrix where a 1 in a particular column denotes the presence of a word, and 0 its absence, using sklearn's CountVectorizer.

Then, we can see that the similarity of two sentences, as rows in the sparse matrix, can be determined by comparing the values of their elements in each column, which is simply the Manhattan distance. This means that we have a nearest neighbour problem.

sklearn also provides a k-dimensional tree class, which we can use to find the nearest two neighbours for each point in the dataset (since a point's nearest neighbour is itself). Then, it remains to find the neighbours with the lowest distances, which we can use to index the original array.

Using %%timeit, I tested the runtime of my solution vs blhsing's solution on the text on this page, leaving imports outside the timing loop:

# my solution
198 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  
# blhsing's solution
4.76 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Restricting the length of sentences to those under 20 words:

# my solution
3.2 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# blhsing's solution
6.08 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

126

answered Oct 12 '22 00:10

gmds

Related questions
                            
                                Plotting numpy array using Seaborn
                            
                                Pandas any() returning false with true values present
                            
                                Dask: Drop NAs on columns?
                            
                                Django shortcut get_object_or_404 inside Django Rest framework Class Based Views
                            
                                How does tensorflow handle non differentiable nodes during gradient calculation?
                            
                                swagger flask restplus, upload a file and take json input together
                            
                                Copying weights of a specific layer - keras
                            
                                What is the difference between Model.train_on_batch from keras and Session.run([train_optimizer]) from tensorflow?
                            
                                In Python, why do properties take priority over instance attributes?
                            
                                Why I got an error in my gitlab CI with Pip which is not found?
                            
                                Global query timeout in MySQL 5.6
                            
                                Apply TensorFlow Transform to transform/scale features in production
                            
                                How can I comment out multiple cells in Jupyter Ipython / JupyterLab notebook?
                            
                                Using line_profiler with numba jitted functions
                            
                                GDAL : Reprojecting netCDF file
                            
                                How to use DRF serializers with Graphene
                            
                                How to pass parameters to a training script in Azure Machine Learning service?
                            
                                Jupyter notebook: let a user inputs a drawing
                            
                                Altair change the orientation of column labels
                            
                                Is splitting assignment into two lines still just as efficient?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With