I am trying to compare two sentences and see if they contain the same set of words.
Eg: comparing "today is a good day" and "is today a good day" should return true
I am using the Counter function from collections module right now
from collections import Counter
vocab = {}
for line in file_ob:
flag = 0
for sentence in vocab:
if Counter(sentence.split(" ")) == Counter(line.split(" ")):
vocab[sentence]+=1
flag = 1
break
if flag==0:
vocab[line]=1
It seems to work fine for a few lines, but my text file has more than 1000 and it never finishes executing. Is there any other way, something more efficient that would help me compute the result for the entire file?
EDIT:
I just need a replacement for the Counter method, something to replace it. And not any change in the implementation.
You really don't need to use two loops.
Let's say you have a dict
:
my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}
Your code is basically equivalent to:
for (key, value) in my_dict.items():
if key == 'c':
print(value)
break
#=> 3
But the whole point of dict
(and set
, Counter
, ...) is to be able to get the desired value directly:
my_dict['c']
#=> 3
If your dict has 1000 values, the first example will be 500 times slower than the second, on average. Here's a simple description I've found on Reddit:
A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.
You just need to find a common signature between "Today is a good day!"
and "Is today a good day?"
. One way would be to extract the words, convert them to lowercase, sort them and join them. What's important is that the output should be immutable (e.g. tuple
, string
, frozenset
). This way, it can be used inside sets, Counters or dicts directly, without needing to iterate over every key.
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
vocab = Counter()
for sentence in sentences:
sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
vocab[sorted_words] += 1
vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
or even shorter:
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
This code should be much faster than what you've tried until now.
If you want to keep the original sentences in a list, you can use setdefault
:
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = {}
for sentence in sentences:
vocab.setdefault(sorted_words(sentence), []).append(sentence)
vocab
#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}
Try something like
set(sentence.split(" ")) == set(line.split(" "))
Comparing set objects is faster than comparing counter. Both set and counter objects are basically sets, however when you use counter object for comparison, it has to compare both the keys and values whereas the set only has to compare keys.
Thank you Eric and Barmar for your inputs.
Your full code will look like
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
for sentence in vocab:
if set(sentence.split(" ")) == set(line.split(" ")):
vocab[sentence]+=1
In your code, you can extract the Counter construction outside of the inner loop, instead of recalculating each for every pair - this should improve the algorithm by a factor proportional to the avg # of tokens per string.
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }
for line in file_obj:
line_counter = Counter(line.split(" "))
for sentence in vocab:
if vocab_counter[sentence] == line_counter:
vocab[sentence]+=1
Further improvements could be had by using the Counters as indices to a dictionary, which would let you replace the linear search for matching sentences with a lookup. The frozendict
package would probably be useful so that you can use a dictionary as a key to another dictionary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With