Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing N Grams using Python

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:

"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I started in Python and used the following code:

#!/usr/bin/env python # File: n-gram.py def N_Gram(N,text): NList = []                      # start with an empty list if N> 1:     space = " " * (N-1)         # add N - 1 spaces     text = space + text + space # add both in front and back # append the slices [i:i+N] to NList for i in range( len(text) - (N - 1) ):     NList.append(text[i:i+N]) return NList                    # return the list # test code for i in range(5): print N_Gram(i+1,"text") # more test code nList = N_Gram(7,"Here is a lot of text to print") for ngram in iter(nList): print '"' + ngram + '"' 

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?

like image 845
gran_profaci Avatar asked Nov 16 '12 20:11

gran_profaci


People also ask

How do I get bigrams in python?

First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.

How do you use n-grams as a feature?

An n-gram is simply any sequence of n tokens (words). Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into: 1-grams: Absolutely, wonderful, silky, and, sexy, and, comfortable.


2 Answers

A short Pythonesque solution from this blog:

def find_ngrams(input_list, n):   return zip(*[input_list[i:] for i in range(n)]) 

Usage:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] >>> find_ngrams(input_list, 1) [('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)] >>> find_ngrams(input_list, 2) [('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')] >>> find_ngrams(input_list, 3)) [('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')] 
like image 109
Franck Dernoncourt Avatar answered Sep 28 '22 12:09

Franck Dernoncourt


Assuming input is a string contains space separated words, like x = "a b c d" you can use the following function (edit: see the last function for a possibly more complete solution):

def ngrams(input, n):     input = input.split(' ')     output = []     for i in range(len(input)-n+1):         output.append(input[i:i+n])     return output  ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']] 

If you want those joined back into strings, you might call something like:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d'] 

Lastly, that doesn't summarize things into totals, so if your input was 'a a a a', you need to count them up into a dict:

for g in (' '.join(x) for x in ngrams(input, 2)):     grams.setdefault(g, 0)     grams[g] += 1 

Putting that all together into one final function gives:

def ngrams(input, n):    input = input.split(' ')    output = {}    for i in range(len(input)-n+1):        g = ' '.join(input[i:i+n])        output.setdefault(g, 0)        output[g] += 1     return output  ngrams('a a a a', 2) # {'a a': 3} 
like image 21
dave mankoff Avatar answered Sep 28 '22 13:09

dave mankoff