Check if two strings contain the same set of words in Python

Question

I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?

EDIT:

I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.

Eric Duminil · Accepted Answer

You really don't need to use two loops.

Correct way to use dicts

Let's say you have a dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

Your code is basically equivalent to:

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

But the whole point of dict (and set, Counter, ...) is to be able to get the desired value directly:

my_dict['c']
#=> 3

If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

Refactored code

You just need to find a common signature between "Today is a good day!" and "Is today a good day?". One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple, string, frozenset). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

or even shorter:

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

This code should be much faster than what you've tried until now.

Yet another alternative

If you want to keep the original sentences in a list, you can use setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}

DataCruncher · Answer

Try something like

set(sentence.split(" ")) == set(line.split(" "))

Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.

Your full code will look like

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

Neal Fultz · Answer

In your code, you can extract the Counter construction outside of the inner loop, instead of recalculating each for every pair - this should improve the algorithm by a factor proportional to the avg # of tokens per string.

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

Further improvements could be had by using the Counters as indices to a dictionary, which would let you replace the linear search for matching sentences with a lookup. The frozendict package would probably be useful so that you can use a dictionary as a key to another dictionary.

Check if two strings contain the same set of words in Python

Tags:

python

text

python-2.7

text-extraction

TheLastCoder

3 Answers

Correct way to use dicts

Refactored code

Yet another alternative

Eric Duminil

DataCruncher

Neal Fultz

Recent Activity

Donate For Us

Check if two strings contain the same set of words in Python

Tags:

python

text

python-2.7

text-extraction

TheLastCoder

3 Answers

Correct way to use dicts

Refactored code

Yet another alternative

Eric Duminil

DataCruncher

Neal Fultz

Related questions

Recent Activity

Donate For Us