I am new to python. I am given a folder with around 2000 text files. I am supposed to output each word and the number of times it occurs (without repetition in a file). For example, the sentence: "i am what i am" must include only one occurrence of "i" in a file.
I am able to do this for a single file, but how do I do it for multiple files?
from collections import Counter
import re
def openfile(filename):
fh = open(filename, "r+")
str = fh.read()
fh.close()
return str
def removegarbage(str):
# Replace one or more non-word (non-alphanumeric) chars with a space
str = re.sub(r'\W+', ' ', str)
str = str.lower()
return str
def getwordbins(words):
cnt = Counter()
for word in words:
cnt[word] += 1
return cnt
def main(filename, topwords):
txt = openfile(filename)
txt = removegarbage(txt)
words = txt.split(' ')
bins = getwordbins(words)
for key, value in bins.most_common(topwords):
print key,value
main('speech.txt', 500)
You can get a list of files by using the glob()
or iglob()
function in the glob
module. I noted that you weren't using the Counter
object efficiently. It would be much better to just call its update()
method and pass it the list of words. Here's a streamlined version of your code that processes all the *.txt
files found in the specified folder:
from collections import Counter
from glob import iglob
import re
import os
def remove_garbage(text):
"""Replace non-word (non-alphanumeric) chars in text with spaces,
then convert and return a lowercase version of the result.
"""
text = re.sub(r'\W+', ' ', text)
text = text.lower()
return text
topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
with open(filepath) as file:
counter.update(remove_garbage(file.read()).split())
for word, count in counter.most_common(topwords):
print('{}: {}'.format(count, word))
If I got your explanation right,you want to calculate for each word the number of files containing this word. Here is what you could do.
For each file obtain a set of words in this file (that is, words should be unique). Then, for each word count the number of sets it can be found in.
Here is what I suggest:
os.listdir
for this purpose.Make a set of words found in this file:
with open(filepath, 'r') as f:
txt = removegarbage(f.read())
words = set(txt.split())
Now when you have a set of words in every file, you can finally use Counter
with those sets. It's best to use its update
method. Here is a little demo:
>>> a = set("hello Python world hello".split())
>>> a
{'Python', 'world', 'hello'}
>>> b = set("foobar hello world".split())
>>> b
{'foobar', 'hello', 'world'}
>>> c = Counter()
>>> c.update(a)
>>> c.update(b)
>>> c
Counter({'world': 2, 'hello': 2, 'Python': 1, 'foobar': 1})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With