unique word frequency in multiple files

Question

I am new to python. I am given a folder with around 2000 text files. I am supposed to output each word and the number of times it occurs (without repetition in a file). For example, the sentence: "i am what i am" must include only one occurrence of "i" in a file.

I am able to do this for a single file, but how do I do it for multiple files?

from collections import Counter
import re

def openfile(filename):
    fh = open(filename, "r+")
    str = fh.read()
    fh.close()
    return str

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

def getwordbins(words):
    cnt = Counter()
    for word in words:
        cnt[word] += 1
    return cnt

def main(filename, topwords):
    txt = openfile(filename)
    txt = removegarbage(txt)
    words = txt.split(' ')
    bins = getwordbins(words)
    for key, value in bins.most_common(topwords):
        print key,value

main('speech.txt', 500)

martineau · Accepted Answer

You can get a list of files by using the glob() or iglob() function in the glob module. I noted that you weren't using the Counter object efficiently. It would be much better to just call its update() method and pass it the list of words. Here's a streamlined version of your code that processes all the *.txt files found in the specified folder:

from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text

topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())

for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))

kirelagin · Answer

If I got your explanation right,you want to calculate for each word the number of files containing this word. Here is what you could do.

For each file obtain a set of words in this file (that is, words should be unique). Then, for each word count the number of sets it can be found in.

Here is what I suggest:

Loop over all the files in the target directory. You can use os.listdir for this purpose.

Make a set of words found in this file:

with open(filepath, 'r') as f:
    txt = removegarbage(f.read())
    words = set(txt.split())

Now when you have a set of words in every file, you can finally use Counter with those sets. It's best to use its update method. Here is a little demo:

>>> a = set("hello Python world hello".split())
>>> a
{'Python', 'world', 'hello'}
>>> b = set("foobar hello world".split())
>>> b
{'foobar', 'hello', 'world'}
>>> c = Counter()
>>> c.update(a)
>>> c.update(b)
>>> c
Counter({'world': 2, 'hello': 2, 'Python': 1, 'foobar': 1})

unique word frequency in multiple files

Tags:

python

data-mining

user2464521

2 Answers

martineau

kirelagin

Recent Activity

Donate For Us

unique word frequency in multiple files

Tags:

python

data-mining

user2464521

2 Answers

martineau

kirelagin

Related questions

Recent Activity

Donate For Us