Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency




I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written code to input my files into the program.

The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count.

I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html

but I am not that advanced to enter them into my program.

input: txt files NOT single sentences

output example:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]   Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')] 

My code up to now is:

from nltk.corpus import PlaintextCorpusReader corpus = 'C:/Users/jack3/My folder' files = PlaintextCorpusReader(corpus, '.*') ngrams=2  def generate(file, ngrams):     for gram in range(0, ngrams):     print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))   for file in files.fileids(): generate(file, ngrams) 

Any help what should be done next?

like image 346
Arash Avatar asked Sep 07 '15 15:09


People also ask

What is Unigrams and Bigrams in Python?

In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc. Here our focus will be on implementing the unigrams(single words) models in python.

Why Bigrams are more than Unigrams?

This happens because there are many words that repeat in case of unigram but in case of bigram fewer words repeat and in case of trigrams even lesser number of words would repeat.

What are Bigrams and Trigrams used for?

Sentiment analysis of Bigram/Trigram N-grams analyses are often used to see which words often show up together. I often like to investigate combinations of two words or three words, i.e., Bigrams/Trigrams. An n-gram is a contiguous sequence of n items from a given sample of text or speech.

1 Answers

Just use ntlk.ngrams.

import nltk from nltk import word_tokenize from nltk.util import ngrams from collections import Counter  text = "I need to write a program in NLTK that breaks a corpus (a large collection of \ txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\  I need to write a program in NLTK that breaks a corpus" token = nltk.word_tokenize(text) bigrams = ngrams(token,2) trigrams = ngrams(token,3) fourgrams = ngrams(token,4) fivegrams = ngrams(token,5)  print Counter(bigrams)  Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,  ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,  ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,  ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams',  ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,  (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',  'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,  ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1,  ('collection', 'of'): 1, ('files', ')'): 1}) 

UPDATE (with pure python):

import os  corpus = [] path = '.' for i in os.walk(path).next()[2]:     if i.endswith('.txt'):         f = open(os.path.join(path,i))         corpus.append(f.read()) frequencies = Counter([]) for text in corpus:     token = nltk.word_tokenize(text)     bigrams = ngrams(token, 2)     frequencies += Counter(bigrams) 
like image 160
hellpanderr Avatar answered Sep 21 '22 02:09
