Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if two strings contain the same set of words in Python

I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?

EDIT:

I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.

like image 222
TheLastCoder Avatar asked Jun 25 '19 20:06

TheLastCoder


3 Answers

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}
like image 96
Eric Duminil Avatar answered Oct 21 '22 21:10

Eric Duminil


Try something like

set(sentence.split(" ")) == set(line.split(" "))

Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.

Your full code will look like

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

like image 25
DataCruncher Avatar answered Oct 21 '22 20:10

DataCruncher


In your code, you can extract the Counter construction outside of the inner loop, instead of recalculating each for every pair - this should improve the algorithm by a factor proportional to the avg # of tokens per string.

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

Further improvements could be had by using the Counters as indices to a dictionary, which would let you replace the linear search for matching sentences with a lookup. The frozendict package would probably be useful so that you can use a dictionary as a key to another dictionary.

like image 43
Neal Fultz Avatar answered Oct 21 '22 22:10

Neal Fultz