I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written code to input my files into the program.
The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count.
I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html
but I am not that advanced to enter them into my program.
input: txt files NOT single sentences
output example:
Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
My code up to now is:
from nltk.corpus import PlaintextCorpusReader corpus = 'C:/Users/jack3/My folder' files = PlaintextCorpusReader(corpus, '.*') ngrams=2 def generate(file, ngrams): for gram in range(0, ngrams): print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_")) for file in files.fileids(): generate(file, ngrams)
Any help what should be done next?
In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc. Here our focus will be on implementing the unigrams(single words) models in python.
This happens because there are many words that repeat in case of unigram but in case of bigram fewer words repeat and in case of trigrams even lesser number of words would repeat.
Sentiment analysis of Bigram/Trigram N-grams analyses are often used to see which words often show up together. I often like to investigate combinations of two words or three words, i.e., Bigrams/Trigrams. An n-gram is a contiguous sequence of n items from a given sample of text or speech.
Just use ntlk.ngrams
.
import nltk from nltk import word_tokenize from nltk.util import ngrams from collections import Counter text = "I need to write a program in NLTK that breaks a corpus (a large collection of \ txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ I need to write a program in NLTK that breaks a corpus" token = nltk.word_tokenize(text) bigrams = ngrams(token,2) trigrams = ngrams(token,3) fourgrams = ngrams(token,4) fivegrams = ngrams(token,5) print Counter(bigrams) Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2, ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2, ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2, ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1, (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of', 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1, ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, ('collection', 'of'): 1, ('files', ')'): 1})
UPDATE (with pure python):
import os corpus = [] path = '.' for i in os.walk(path).next()[2]: if i.endswith('.txt'): f = open(os.path.join(path,i)) corpus.append(f.read()) frequencies = Counter([]) for text in corpus: token = nltk.word_tokenize(text) bigrams = ngrams(token, 2) frequencies += Counter(bigrams)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With