Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

Tags:

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written code to input my files into the program.

The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count.

I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html

but I am not that advanced to enter them into my program.

input: txt files NOT single sentences

output example:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]   Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

My code up to now is:

from nltk.corpus import PlaintextCorpusReader corpus = 'C:/Users/jack3/My folder' files = PlaintextCorpusReader(corpus, '.*') ngrams=2  def generate(file, ngrams):     for gram in range(0, ngrams):     print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))   for file in files.fileids(): generate(file, ngrams)

Any help what should be done next?

346

asked Sep 07 '15 15:09

Arash

1 Answers

Just use ntlk.ngrams.

import nltk from nltk import word_tokenize from nltk.util import ngrams from collections import Counter  text = "I need to write a program in NLTK that breaks a corpus (a large collection of \ txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\  I need to write a program in NLTK that breaks a corpus" token = nltk.word_tokenize(text) bigrams = ngrams(token,2) trigrams = ngrams(token,3) fourgrams = ngrams(token,4) fivegrams = ngrams(token,5)  print Counter(bigrams)  Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,  ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,  ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,  ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams',  ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,  (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',  'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,  ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1,  ('collection', 'of'): 1, ('files', ')'): 1})

UPDATE (with pure python):

import os  corpus = [] path = '.' for i in os.walk(path).next()[2]:     if i.endswith('.txt'):         f = open(os.path.join(path,i))         corpus.append(f.read()) frequencies = Counter([]) for text in corpus:     token = nltk.word_tokenize(text)     bigrams = ngrams(token, 2)     frequencies += Counter(bigrams)

160

answered Sep 21 '22 02:09

hellpanderr

Related questions
                            
                                Disable the underlying window when a popup is created in Python TKinter
                            
                                Pip install python package into a specific directory other than the default install location
                            
                                Grammatical List Join in Python [duplicate]
                            
                                Interpolation over regular grid in Python [closed]
                            
                                setup.py sdist exclude packages in subdirectory
                            
                                Add a parameter into kwargs during function call?
                            
                                Access config values in Flask from other files
                            
                                How to delete pages from pdf file using Python?
                            
                                How to calculate time difference by group using pandas?
                            
                                Catching boto3 ClientError subclass
                            
                                How to serialize Python objects in a human-readable format? [closed]
                            
                                POS tagging in German
                            
                                merging Python dictionaries
                            
                                How to scroll text in Python/Curses subwindow?
                            
                                Accessing CPU temperature in python
                            
                                Getting Every File in a Windows Directory
                            
                                OrderedDict performance (compared to deque)
                            
                                Python ftplib - specify port
                            
                                Should I unittest private/protected method
                            
                                Merge dataframes in a dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

Tags:

python

nltk

Arash

People also ask

1 Answers

hellpanderr

Recent Activity

Donate For Us