I'm trying to make a python script that takes a string and gives the count of consecutive words. Let's say: <pre class="prettyprint"><code>string = " i have no idea how to write this script. i have an idea." output = ['i', 'have'] 2 ['have', 'no'] 1 ['no', 'idea'] 1 ['idea', 'how'] 1 ['how', 'to'] 1 ['to', 'write'] 1 ... </code></pre> I'm trying to use python without importing collections, counters from collections. What I have is below. I'm trying to use a <code>re.findall(#whatpatterndoiuse, string)</code> to iterate through the string and compare it but I'm having difficulties figuring out how to. <pre class="prettyprint"><code>string2 = re.split('\s+', string. lower()) freq_dict = {} #empty dictionary for word in word_list: word = punctuation.sub("", word) freq_dic[word] = freq_dic.get(word,0) + 1 freq_list = freq_dic.items() freq_list.sort() for word, freq in freq_list: print word, freq </code></pre> Using counter from collections which I did not want. Also it produce an output in a format that is not the one I stated above. <pre class="prettyprint"><code>import re from collections import Counter words = re.findall('\w+', open('a.txt').read()) print(Counter(zip(words,words[1:]))) </code></pre>

Solving this without zip is fairly simple. Just build tuples of each pair of words and track their count in a dict. There are just a few special cases to watch for - when the input string only has one word, and when you are at the end of the string. Give this a shot: <pre class="prettyprint"><code>def freq(input_string): freq = {} words = input_string.split() if len(words) == 1: return freq for idx, word in enumerate(words): if idx+1 < len(words): word_pair = (word, words[idx+1]) if word_pair in freq: freq[word_pair] += 1 else: freq[word_pair] = 1 return freq </code></pre>

You need to solve three problems: <ol> <li>generate all pairs of words (<code>['i', 'have']</code>, <code>['have', 'no']</code>, ...);</li> <li>count the occurrences of these pair of words;</li> <li>sort the pairs from the most common to the least common.</li> </ol> The second problem can be easily solved by using a <code>Counter</code>. <code>Counter</code> objects also provide a <code>most_common()</code> method to solve the third problem. The first problem can be solved in many ways. The most compact way is using <code>zip</code>: <pre class="prettyprint"><code>>>> import re >>> s = 'i have no idea how to write this script. i have an idea.' >>> words = re.findall('\w+', s) >>> pairs = zip(words, words[1:]) >>> list(pairs) [('i', 'have'), ('have', 'no'), ('no', 'idea'), ...] </code></pre> Putting everything together: <pre class="prettyprint"><code>import collections import re def count_pairs(s): """ Returns a mapping that links each pair of words to its number of occurrences. """ words = re.findall('\w+', s.lower()) pairs = zip(words, words[1:]) return collections.Counter(pairs) def print_freqs(s): """ Prints the number of occurrences of word pairs from the most common to the least common. """ cnt = count_pairs(s) for pair, count in cnt.most_common(): print list(pair), count </code></pre> EDIT: I realized just now that I accidentally read "with collections, counters, ..." instead of "with out importing collections, ...". My bad, sorry.

How to get consecutive word count of a string python

Tags:

python

word-count

I'm trying to make a python script that takes a string and gives the count of consecutive words. Let's say:

string = " i have no idea how to write this script. i have an idea."

output = 
['i', 'have'] 2
['have', 'no'] 1
['no', 'idea'] 1
['idea', 'how'] 1
['how', 'to'] 1
['to', 'write'] 1
...

I'm trying to use python without importing collections, counters from collections. What I have is below. I'm trying to use a re.findall(#whatpatterndoiuse, string) to iterate through the string and compare it but I'm having difficulties figuring out how to.

string2 = re.split('\s+', string. lower())
freq_dict = {} #empty dictionary
for word in word_list:
    word = punctuation.sub("", word)
    freq_dic[word] = freq_dic.get(word,0) + 1

freq_list = freq_dic.items()
freq_list.sort()
for word, freq in freq_list:
    print word, freq

Using counter from collections which I did not want. Also it produce an output in a format that is not the one I stated above.

import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))

334

asked Nov 15 '15 18:11

Valerio Zhang

2 Answers

Solving this without zip is fairly simple. Just build tuples of each pair of words and track their count in a dict. There are just a few special cases to watch for - when the input string only has one word, and when you are at the end of the string.

Give this a shot:

def freq(input_string):
    freq = {}
    words = input_string.split()
    if len(words) == 1:
        return freq

    for idx, word in enumerate(words):
        if idx+1 < len(words):
            word_pair = (word, words[idx+1])
            if word_pair in freq:
                freq[word_pair] += 1
            else:
                freq[word_pair] = 1

    return freq

151

answered Oct 10 '22 19:10

tknickman

You need to solve three problems:

generate all pairs of words (['i', 'have'], ['have', 'no'], ...);
count the occurrences of these pair of words;
sort the pairs from the most common to the least common.

The second problem can be easily solved by using a Counter. Counter objects also provide a most_common() method to solve the third problem.

The first problem can be solved in many ways. The most compact way is using zip:

>>> import re
>>> s = 'i have no idea how to write this script. i have an idea.'
>>> words = re.findall('\w+', s)
>>> pairs = zip(words, words[1:])
>>> list(pairs)
[('i', 'have'), ('have', 'no'), ('no', 'idea'), ...]

Putting everything together:

import collections
import re

def count_pairs(s):
    """
    Returns a mapping that links each pair of words
    to its number of occurrences.
    """
    words = re.findall('\w+', s.lower())
    pairs = zip(words, words[1:])
    return collections.Counter(pairs)

def print_freqs(s):
    """
    Prints the number of occurrences of word pairs
    from the most common to the least common.
    """
    cnt = count_pairs(s)
    for pair, count in cnt.most_common():
        print list(pair), count

EDIT: I realized just now that I accidentally read "with collections, counters, ..." instead of "with out importing collections, ...". My bad, sorry.

answered Oct 10 '22 18:10

Andrea Corbellini

Related questions
                            
                                AttributeError: 'Response' object has no attribute 'read'
                            
                                Element wise comparison between 1D and 2D array
                            
                                Where is the Python interpreter that Sublime Text uses to run plugins?
                            
                                match a regular expression with optional lookahead
                            
                                pyinstaller: change application icon
                            
                                Finding matching strings when comparing two lists
                            
                                Read console output of another program in Python
                            
                                Stop pydoc from running my Python program
                            
                                Python-Django timezone is not working properly
                            
                                Scikit-learn Random Forest out of bag sample
                            
                                Python Selenium: input textbox, send_keys not working
                            
                                How can I attach a vertical scrollbar to a treeview using Tkinter?
                            
                                How to make an optional decorator in Python
                            
                                How to merge two data frames based on nearest date
                            
                                how to make 1 by n dataframe from series in pandas?
                            
                                merging two pandas dataframes on nearest time stamp
                            
                                Python smtplib login error smtplib.SMTPException: STARTTLS extension not supported by server
                            
                                Split big csv file by the value of a column in python
                            
                                Why lxml isn't finding xpath given by Chrome inspector?
                            
                                Show a Pandas plot from a script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With