Counting bigrams from user input in python 3?

Question

I'm stuck and need a little guidance. I'm trying hard to learn Python on my own using Grok Learning. Below is the Problem and example output along with where I am in the code. I appreciate any tips that will help me solve this problem.

In linguistics, a bigram is a pair of adjacent words in a sentence. The sentence "The big red ball." has three bigrams: The big, big red, and red ball.

Write a program to read in multiple lines of input from the user, where each line is a space-separated sentence of words. Your program should then count up how many times each of the bigrams occur across all input sentences. The bigrams should be treated in a case insensitive manner by converting the input lines to lowercase. Once the user stops entering input, your program should print out each of the bigrams that appear more than once, along with their corresponding frequencies. For example:
Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
the big: 3
big red: 3

I haven't gotten very far with my code and am really stuck. But here is where I am:

words = set()
line = input("Line: ")
while line != '':
  words.add(line)
  line = input("Line: ")

Am I even doing this right? Try not to import any modules and just use built-in functionality.

Thanks, Jeff

randomir · Accepted Answer

Let's start with the function that receives a sentence (with punctuation) and returns a list of all lowercase bigrams found.

So, we first need to strip all non-alphanumerics from the sentence, convert all letters to lowercase counterparts, and then split the sentence by spaces into a list of words:

import re

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

We'll use the standard (builtin) re package for regular expression based substitution of non-alphanumerics with spaces, and the builtin zip function to pair-up consecutive words. (We pair the list of words with the same list, but shifted by one element.)

Now we can test it:

>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE  big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]

Next, for counting bigrams found in each sentence, you might use collections.Counter.

For example, like this:

from collections import Counter

counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
    counts.update(bigrams(line))

We get:

>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})

Now we just need to print those that appear more than once:

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

All put together, with a loop for user input, instead of the fixed list:

import re
from collections import Counter

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

counts = Counter()
while True:
    line = input("Line: ")
    if not line:
        break
    counts.update(bigrams(line))

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

The output:

Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
big red: 3
the big: 3

inspectorG4dget · Answer

words = []
while True:
    line = input("Line: ").strip().lower()
    if not line: break
    words.extend(line.split())


counts = {}
for t in zip(words[::2], words[1::2]):
    if t not in counts: counts[t] = 0
    counts[t] += 1

Counting bigrams from user input in python 3?

Tags:

python

python-3.x

Jeff Singleton

2 Answers

randomir

inspectorG4dget

Recent Activity

Donate For Us

Counting bigrams from user input in python 3?

Tags:

python

python-3.x

Jeff Singleton

2 Answers

randomir

inspectorG4dget

Related questions

Recent Activity

Donate For Us