Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unique word frequency in multiple files

I am new to python. I am given a folder with around 2000 text files. I am supposed to output each word and the number of times it occurs (without repetition in a file). For example, the sentence: "i am what i am" must include only one occurrence of "i" in a file.

I am able to do this for a single file, but how do I do it for multiple files?

from collections import Counter
import re

def openfile(filename):
    fh = open(filename, "r+")
    str = fh.read()
    fh.close()
    return str

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

def getwordbins(words):
    cnt = Counter()
    for word in words:
        cnt[word] += 1
    return cnt

def main(filename, topwords):
    txt = openfile(filename)
    txt = removegarbage(txt)
    words = txt.split(' ')
    bins = getwordbins(words)
    for key, value in bins.most_common(topwords):
        print key,value

main('speech.txt', 500)
like image 730
user2464521 Avatar asked Jan 14 '23 04:01

user2464521


2 Answers

You can get a list of files by using the glob() or iglob() function in the glob module. I noted that you weren't using the Counter object efficiently. It would be much better to just call its update() method and pass it the list of words. Here's a streamlined version of your code that processes all the *.txt files found in the specified folder:

from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text

topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())

for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))
like image 175
martineau Avatar answered Jan 15 '23 18:01

martineau


If I got your explanation right,you want to calculate for each word the number of files containing this word. Here is what you could do.

For each file obtain a set of words in this file (that is, words should be unique). Then, for each word count the number of sets it can be found in.

Here is what I suggest:

  1. Loop over all the files in the target directory. You can use os.listdir for this purpose.
  2. Make a set of words found in this file:

    with open(filepath, 'r') as f:
        txt = removegarbage(f.read())
        words = set(txt.split())
    
  3. Now when you have a set of words in every file, you can finally use Counter with those sets. It's best to use its update method. Here is a little demo:

    >>> a = set("hello Python world hello".split())
    >>> a
    {'Python', 'world', 'hello'}
    >>> b = set("foobar hello world".split())
    >>> b
    {'foobar', 'hello', 'world'}
    >>> c = Counter()
    >>> c.update(a)
    >>> c.update(b)
    >>> c
    Counter({'world': 2, 'hello': 2, 'Python': 1, 'foobar': 1})
    
like image 39
kirelagin Avatar answered Jan 15 '23 19:01

kirelagin