I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like:
"Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the airways of cystic fibrosis sufferers, although side effects include a nasty coughing fit and a harsh taste. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
I started in Python and used the following code:
#!/usr/bin/env python # File: n-gram.py def N_Gram(N,text): NList = [] # start with an empty list if N> 1: space = " " * (N-1) # add N - 1 spaces text = space + text + space # add both in front and back # append the slices [i:i+N] to NList for i in range( len(text) - (N - 1) ): NList.append(text[i:i+N]) return NList # return the list # test code for i in range(5): print N_Gram(i+1,"text") # more test code nList = N_Gram(7,"Here is a lot of text to print") for ngram in iter(nList): print '"' + ngram + '"'
http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word
But it works for all the n-grams within a word, when I want it from between words as in CYSTIC and FIBROSIS or CYSTIC FIBROSIS. Can someone help me out as to how I can get this done?
First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.
An n-gram is simply any sequence of n tokens (words). Consequently, given the following review text - “Absolutely wonderful - silky and sexy and comfortable”, we could break this up into: 1-grams: Absolutely, wonderful, silky, and, sexy, and, comfortable.
A short Pythonesque solution from this blog:
def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)])
Usage:
>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less'] >>> find_ngrams(input_list, 1) [('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)] >>> find_ngrams(input_list, 2) [('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')] >>> find_ngrams(input_list, 3)) [('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]
Assuming input is a string contains space separated words, like x = "a b c d"
you can use the following function (edit: see the last function for a possibly more complete solution):
def ngrams(input, n): input = input.split(' ') output = [] for i in range(len(input)-n+1): output.append(input[i:i+n]) return output ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]
If you want those joined back into strings, you might call something like:
[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']
Lastly, that doesn't summarize things into totals, so if your input was 'a a a a'
, you need to count them up into a dict:
for g in (' '.join(x) for x in ngrams(input, 2)): grams.setdefault(g, 0) grams[g] += 1
Putting that all together into one final function gives:
def ngrams(input, n): input = input.split(' ') output = {} for i in range(len(input)-n+1): g = ' '.join(input[i:i+n]) output.setdefault(g, 0) output[g] += 1 return output ngrams('a a a a', 2) # {'a a': 3}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With