I have a question to help streamline my programming.
So I have this file text.txt
and in it I want to look through it and compare it with a list of words words
and each time the word is found it adds 1
to an integer.
words = ['the', 'or', 'and', 'can', 'help', 'it', 'one', 'two']
ints = []
with open('text.txt') as file:
for line in file:
for part in line.split():
for word in words:
if word in part:
ints.append(1)
I was just wondering if there was a faster way to do this? The text files could be rather larger and the list of words will be much larger.
You can convert the words
to a set
, so that the lookups will be faster. This should give a good performance boost to your program, because looking up a value in a list has to traverse the list one element at a time (O(n) runtime complexity), but when you convert the list to a set, the runtime complexity will reduce to O(1) (constant time). Because sets use hashes to find the elements.
words = {'the', 'or', 'and', 'can', 'help', 'it', 'one', 'two'}
And then whenever there is a match, you can use sum
function to count it like this
with open('text.txt') as file:
print(sum(part in words for line in file for part in line.split()))
Boolean values and their integer equivalents
In Python, the result of boolean expressions will be equal to either 0
or 1
for False
and True
respectively.
>>> True == 1
True
>>> False == 0
True
>>> int(True)
1
>>> int(False)
0
>>> sum([True, True, True])
3
>>> sum([True, False, True])
2
So whenever you check if part in words
, the result will be either 0
or 1
and we sum
all those values.
The above seen code is functionally equivalent to
result = 0
with open('text.txt') as file:
for line in file:
for part in line.split():
if part in words:
result += 1
Note: In case you really wanted to get 1
's in a list whenever there is a match, then you can simply convert the generator expression to sum
to a list comprehension, like this
with open('text.txt') as file:
print([int(part in words) for line in file for part in line.split()])
Frequency of words
If you actually wanted to find the frequency of individual words in the words
, then you can use collections.Counter
like this
from collections import Counter
with open('text.txt') as file:
c = Counter(part for line in file for part in line.split() if part in words)
This will internally count the number of times each of the words in words
occur in the file.
As per the comment, you can have a dictionary where you can store positive words with positive score, and negative words with negative score and count them like this
words = {'happy': 1, 'good': 1, 'great': 1, 'no': -1, 'hate': -1}
with open('text.txt') as file:
print(sum(words.get(part, 0) for line in file for part in line.split()))
Here, we use words.get
dictionary to get the value stored against the word and if the word is not found in the dictionary (neither a good word nor a bad word) then return the default value 0
.
You can use set.intersection
to find the intersection between a set and list so as a more efficient way put your words within a set
and do :
words={'the','or','and','can','help','it','one','two'}
ints=[]
with open('text.txt') as f:
for line in f:
for _ in range(len(words.intersection(line.split()))):
ints.append(1)
Note that the preceding solution is based on your code that you added 1 to a list. of you want to find the final count you can use a generator expression within sum
:
words={'the','or','and','can','help','it','one','two'}
with open('text.txt') as f:
sum(len(words.intersection(line.split())) for line in f)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With