Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get consecutive word count of a string python

I'm trying to make a python script that takes a string and gives the count of consecutive words. Let's say:

string = " i have no idea how to write this script. i have an idea."

output = 
['i', 'have'] 2
['have', 'no'] 1
['no', 'idea'] 1
['idea', 'how'] 1
['how', 'to'] 1
['to', 'write'] 1
...

I'm trying to use python without importing collections, counters from collections. What I have is below. I'm trying to use a re.findall(#whatpatterndoiuse, string) to iterate through the string and compare it but I'm having difficulties figuring out how to.

string2 = re.split('\s+', string. lower())
freq_dict = {} #empty dictionary
for word in word_list:
    word = punctuation.sub("", word)
    freq_dic[word] = freq_dic.get(word,0) + 1

freq_list = freq_dic.items()
freq_list.sort()
for word, freq in freq_list:
    print word, freq

Using counter from collections which I did not want. Also it produce an output in a format that is not the one I stated above.

import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
like image 334
Valerio Zhang Avatar asked Nov 15 '15 18:11

Valerio Zhang


People also ask

How do you count consecutive characters in a string in Python?

Given a String, extract all the K-length consecutive characters. Input : test_str = 'geekforgeeeksss is bbbest forrr geeks', K = 3 Output : ['eee', 'sss', 'bbb', 'rrr'] Explanation : K length consecutive strings extracted.

How do you get a word count in a string Python?

Use the count() Method to Count Words in Python String Python. The count() method is a Python built-in method. It takes three parameters and returns the number of occurrences based on the given substring.


2 Answers

Solving this without zip is fairly simple. Just build tuples of each pair of words and track their count in a dict. There are just a few special cases to watch for - when the input string only has one word, and when you are at the end of the string.

Give this a shot:

def freq(input_string):
    freq = {}
    words = input_string.split()
    if len(words) == 1:
        return freq

    for idx, word in enumerate(words):
        if idx+1 < len(words):
            word_pair = (word, words[idx+1])
            if word_pair in freq:
                freq[word_pair] += 1
            else:
                freq[word_pair] = 1

    return freq
like image 151
tknickman Avatar answered Oct 10 '22 19:10

tknickman


You need to solve three problems:

  1. generate all pairs of words (['i', 'have'], ['have', 'no'], ...);
  2. count the occurrences of these pair of words;
  3. sort the pairs from the most common to the least common.

The second problem can be easily solved by using a Counter. Counter objects also provide a most_common() method to solve the third problem.

The first problem can be solved in many ways. The most compact way is using zip:

>>> import re
>>> s = 'i have no idea how to write this script. i have an idea.'
>>> words = re.findall('\w+', s)
>>> pairs = zip(words, words[1:])
>>> list(pairs)
[('i', 'have'), ('have', 'no'), ('no', 'idea'), ...]

Putting everything together:

import collections
import re

def count_pairs(s):
    """
    Returns a mapping that links each pair of words
    to its number of occurrences.
    """
    words = re.findall('\w+', s.lower())
    pairs = zip(words, words[1:])
    return collections.Counter(pairs)

def print_freqs(s):
    """
    Prints the number of occurrences of word pairs
    from the most common to the least common.
    """
    cnt = count_pairs(s)
    for pair, count in cnt.most_common():
        print list(pair), count

EDIT: I realized just now that I accidentally read "with collections, counters, ..." instead of "with out importing collections, ...". My bad, sorry.

like image 41
Andrea Corbellini Avatar answered Oct 10 '22 18:10

Andrea Corbellini