I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: "Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine." I started in Python and used the following code: <pre class="prettyprint"><code>#!/usr/bin/env python # File: n-gram.py def N_Gram(N,text): NList = [] # start with an empty list if N> 1: space = " " * (N-1) # add N - 1 spaces text = space + text + space # add both in front and back # append the slices [i:i+N] to NList for i in range( len(text) - (N - 1) ): NList.append(text[i:i+N]) return NList # return the list # test code for i in range(5): print N_Gram(i+1,"text") # more test code nList = N_Gram(7,"Here is a lot of text to print") for ngram in iter(nList): print '"' + ngram + '"' </code></pre> http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

A short Pythonesque solution from this blog: <pre class="prettyprint"><code>def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) </code></pre> Usage: <pre class="prettyprint"><code>>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] >>> find_ngrams(input_list, 1) [('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)] >>> find_ngrams(input_list, 2) [('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')] >>> find_ngrams(input_list, 3)) [('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')] </code></pre>

Assuming input is a string contains space separated words, like <code>x = "a b c d"</code> you can use the following function (edit: see the last function for a possibly more complete solution): <pre class="prettyprint"><code>def ngrams(input, n): input = input.split(' ') output = [] for i in range(len(input)-n+1): output.append(input[i:i+n]) return output ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']] </code></pre> If you want those joined back into strings, you might call something like: <pre class="prettyprint"><code>[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d'] </code></pre> Lastly, that doesn't summarize things into totals, so if your input was <code>'a a a a'</code>, you need to count them up into a dict: <pre class="prettyprint"><code>for g in (' '.join(x) for x in ngrams(input, 2)): grams.setdefault(g, 0) grams[g] += 1 </code></pre> Putting that all together into one final function gives: <pre class="prettyprint"><code>def ngrams(input, n): input = input.split(' ') output = {} for i in range(len(input)-n+1): g = ' '.join(input[i:i+n]) output.setdefault(g, 0) output[g] += 1 return output ngrams('a a a a', 2) # {'a a': 3} </code></pre>

Computing N Grams using Python

Tags:

python

nlp

nltk

n-gram

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I started in Python and used the following code:

#!/usr/bin/env python # File: n-gram.py def N_Gram(N,text): NList = []                      # start with an empty list if N> 1:     space = " " * (N-1)         # add N - 1 spaces     text = space + text + space # add both in front and back # append the slices [i:i+N] to NList for i in range( len(text) - (N - 1) ):     NList.append(text[i:i+N]) return NList                    # return the list # test code for i in range(5): print N_Gram(i+1,"text") # more test code nList = N_Gram(7,"Here is a lot of text to print") for ngram in iter(nList): print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

845

asked Nov 16 '12 20:11

gran_profaci

2 Answers

A short Pythonesque solution from this blog:

def find_ngrams(input_list, n):   return zip(*[input_list[i:] for i in range(n)])

Usage:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] >>> find_ngrams(input_list, 1) [('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)] >>> find_ngrams(input_list, 2) [('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')] >>> find_ngrams(input_list, 3)) [('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]

109

answered Sep 28 '22 12:09

Franck Dernoncourt

Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):     input = input.split(' ')     output = []     for i in range(len(input)-n+1):         output.append(input[i:i+n])     return output  ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):     grams.setdefault(g, 0)     grams[g] += 1

Putting that all together into one final function gives:

def ngrams(input, n):    input = input.split(' ')    output = {}    for i in range(len(input)-n+1):        g = ' '.join(input[i:i+n])        output.setdefault(g, 0)        output[g] += 1     return output  ngrams('a a a a', 2) # {'a a': 3}

answered Sep 28 '22 13:09

dave mankoff

Related questions
                            
                                How do I get word frequency in a corpus using Scikit Learn CountVectorizer?
                            
                                How can I Group By Month from a Date field using Python/Pandas
                            
                                How to make a checkerboard in numpy?
                            
                                How to parse list of models with Pydantic
                            
                                In Python How can I declare a Dynamic Array
                            
                                Homebrew , python installing
                            
                                Python Weighted Random [duplicate]
                            
                                Cheap exception handling in Python?
                            
                                Insert a newline character every 64 characters using Python
                            
                                Numercially stable softmax
                            
                                Convert an RFC 3339 time to a standard Python timestamp
                            
                                For loop - like Python range function
                            
                                How to generate a temporary url to upload file to Amazon S3 with boto library?
                            
                                How does pgBouncer help to speed up Django
                            
                                mysql_config not found when installing mysqldb python interface for mariadb 10 Ubuntu 13.10
                            
                                Invalid Syntax error when running python from inside Visual Studio Code
                            
                                Pandas: slow date conversion
                            
                                Installing opencv on Windows 10 with python 3.6 and anaconda 3.6
                            
                                Using regex to remove comments from source files
                            
                                Paramiko "Unknown Server"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With