Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do the count the number of sentences, words and characters in a file?

Tags:

python

nltk

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt
like image 731
aks Avatar asked Feb 22 '11 05:02

aks


People also ask

How do I count the number of characters in a file?

Approach: The characters can be counted easily by reading the characters in the file using getc() method. For each character read from the file, increment the counter by one. char c; // Get file name from user.

How do you count characters and words?

You can get a character count in a Word document by selecting the "Review" tab and clicking "Word Count." You can find both the number of characters with spaces and the character count not including spaces. You can add the Word Count dialog box to the Quick Access toolbar so it's always one click away.


3 Answers

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

Hope this helps

like image 137
inspectorG4dget Avatar answered Sep 28 '22 02:09

inspectorG4dget


With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)

And in your case:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
like image 30
TheIdealis Avatar answered Sep 28 '22 01:09

TheIdealis


For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
like image 41
salvu Avatar answered Sep 28 '22 00:09

salvu