I'm stuck and need a little guidance. I'm trying hard to learn Python on my own using Grok Learning. Below is the Problem and example output along with where I am in the code. I appreciate any tips that will help me solve this problem.
In linguistics, a bigram is a pair of adjacent words in a sentence. The sentence "The big red ball." has three bigrams: The big, big red, and red ball.
Write a program to read in multiple lines of input from the user, where each line is a space-separated sentence of words. Your program should then count up how many times each of the bigrams occur across all input sentences. The bigrams should be treated in a case insensitive manner by converting the input lines to lowercase. Once the user stops entering input, your program should print out each of the bigrams that appear more than once, along with their corresponding frequencies. For example:
Line: The big red ball Line: The big red ball is near the big red box Line: I am near the box Line: near the: 2 red ball: 2 the big: 3 big red: 3
I haven't gotten very far with my code and am really stuck. But here is where I am:
words = set()
line = input("Line: ")
while line != '':
words.add(line)
line = input("Line: ")
Am I even doing this right? Try not to import any modules and just use built-in functionality.
Thanks, Jeff
Let's start with the function that receives a sentence (with punctuation) and returns a list of all lowercase bigrams found.
So, we first need to strip all non-alphanumerics from the sentence, convert all letters to lowercase counterparts, and then split the sentence by spaces into a list of words:
import re
def bigrams(sentence):
text = re.sub('\W', ' ', sentence.lower())
words = text.split()
return zip(words, words[1:])
We'll use the standard (builtin) re package for regular expression based substitution of non-alphanumerics with spaces, and the builtin zip function to pair-up consecutive words. (We pair the list of words with the same list, but shifted by one element.)
Now we can test it:
>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
Next, for counting bigrams found in each sentence, you might use collections.Counter.
For example, like this:
from collections import Counter
counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
counts.update(bigrams(line))
We get:
>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})
Now we just need to print those that appear more than once:
for bigr, cnt in counts.items():
if cnt > 1:
print("{0[0]} {0[1]}: {1}".format(bigr, cnt))
All put together, with a loop for user input, instead of the fixed list:
import re
from collections import Counter
def bigrams(sentence):
text = re.sub('\W', ' ', sentence.lower())
words = text.split()
return zip(words, words[1:])
counts = Counter()
while True:
line = input("Line: ")
if not line:
break
counts.update(bigrams(line))
for bigr, cnt in counts.items():
if cnt > 1:
print("{0[0]} {0[1]}: {1}".format(bigr, cnt))
The output:
Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line:
near the: 2
red ball: 2
big red: 3
the big: 3
words = []
while True:
line = input("Line: ").strip().lower()
if not line: break
words.extend(line.split())
counts = {}
for t in zip(words[::2], words[1::2]):
if t not in counts: counts[t] = 0
counts[t] += 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With