Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fastest way to compare text file content

I have a question to help streamline my programming. So I have this file text.txt and in it I want to look through it and compare it with a list of words words and each time the word is found it adds 1 to an integer.

words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two']
ints = []
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            for word in words:
                if word in part:
                    ints.append(1)

I was just wondering if there was a faster way to do this? The text files could be rather larger and the list of words will be much larger.

like image 954
user1985351 Avatar asked Jan 08 '23 05:01

user1985351


2 Answers

You can convert the words to a set, so that the lookups will be faster. This should give a good performance boost to your program, because looking up a value in a list has to traverse the list one element at a time (O(n) runtime complexity), but when you convert the list to a set, the runtime complexity will reduce to O(1) (constant time). Because sets use hashes to find the elements.

words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}

And then whenever there is a match, you can use sum function to count it like this

with open('text.txt') as file:
    print(sum(part in words for line in file for part in line.split()))

Boolean values and their integer equivalents

In Python, the result of boolean expressions will be equal to either 0 or 1 for False and True respectively.

>>> True == 1
True
>>> False == 0
True
>>> int(True)
1
>>> int(False)
0
>>> sum([True, True, True])
3
>>> sum([True, False, True])
2

So whenever you check if part in words, the result will be either 0 or 1 and we sum all those values.


The above seen code is functionally equivalent to

result = 0
with open('text.txt') as file:
    for line in file:
        for part in line.split():
            if part in words:
                 result += 1

Note: In case you really wanted to get 1's in a list whenever there is a match, then you can simply convert the generator expression to sum to a list comprehension, like this

with open('text.txt') as file:
    print([int(part in words) for line in file for part in line.split()])

Frequency of words

If you actually wanted to find the frequency of individual words in the words, then you can use collections.Counter like this

from collections import Counter
with open('text.txt') as file:
    c = Counter(part for line in file for part in line.split() if part in words)

This will internally count the number of times each of the words in words occur in the file.


As per the comment, you can have a dictionary where you can store positive words with positive score, and negative words with negative score and count them like this

words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1}
with open('text.txt') as file:
    print(sum(words.get(part, 0) for line in file for part in line.split()))

Here, we use words.get dictionary to get the value stored against the word and if the word is not found in the dictionary (neither a good word nor a bad word) then return the default value 0.

like image 162
thefourtheye Avatar answered Jan 14 '23 14:01

thefourtheye


You can use set.intersection to find the intersection between a set and list so as a more efficient way put your words within a set and do :

words={'the','or','and','can','help','it','one','two'}
ints=[]
with open('text.txt') as f:
    for line in f:
        for _ in range(len(words.intersection(line.split()))):
              ints.append(1)

Note that the preceding solution is based on your code that you added 1 to a list. of you want to find the final count you can use a generator expression within sum :

words={'the','or','and','can','help','it','one','two'}
with open('text.txt') as f:
    sum(len(words.intersection(line.split())) for line in f)
like image 41
Mazdak Avatar answered Jan 14 '23 12:01

Mazdak