Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a dictionary for each word in a file and counting the frequency of words that follow it

I am trying to solve a difficult problem and am getting lost.

Here's what I'm supposed to do:

INPUT: file
OUTPUT: dictionary

Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.

You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.

Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
...     word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}

Here's all I have so far, in trying to at least pull out the correct words:

def word_counts(f):
    i = 0
    orgwordlist = f.split()
    for word in orgwordlist:
        if i<len(orgwordlist)-1:
            print orgwordlist[i]
            print orgwordlist[i+1]

with open('../data/example.txt') as f:
    word_counts(f)

I'm thinking I need to somehow use the .count method and eventually zip some dictionaries together, but I'm not sure how to count the second words for each first word.

I know I'm nowhere near solving the problem, but trying to take it one step at a time. Any help is appreciated, even just tips pointing in the right direction.

like image 540
Kristie Avatar asked Jun 23 '17 20:06

Kristie


1 Answers

We can solve this in two passes:

  1. in a first pass, we construct a Counter and count the tuples of two consecutive words using zip(..); and
  2. then we turn that Counter in a dictionary of dictionaries.

This results in the following code:

from collections import Counter, defaultdict

def word_counts(f):
    st = f.read().lower().split()
    ctr = Counter(zip(st,st[1:]))
    dc = defaultdict(dict)
    for (k1,k2),v in ctr.items():
        dc[k1][k2] = v
    return dict(dc)
like image 116
Willem Van Onsem Avatar answered Sep 30 '22 12:09

Willem Van Onsem