I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this. <pre class="prettyprint"><code>>>>import nltk.data >>>import nltk.tokenize >>>f=open('samp.txt') >>>raw=f.read() >>>tokenized_sentences=nltk.sent_tokenize(raw) >>>for each_sentence in tokenized_sentences: ... words=nltk.tokenize.word_tokenize(each_sentence) ... print each_sentence #prints tokenized sentences from samp.txt >>>tokenized_words=nltk.word_tokenize(raw) >>>for each_word in tokenized_words: ... words=nltk.tokenize.word_tokenize(each_word) ... print each_words #prints tokenized words from samp.txt </code></pre>

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1) And in your case: <pre class="prettyprint"><code>import nltk raw = open('samp.txt').read() raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8'))) fdist = nltk.FreqDist(raw) print fdist.N() </code></pre>

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the <code>textstat</code> package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence. <pre class="prettyprint"><code>import textstat your_text = "This is a sentence! This is sentence two. And this is the final sentence?" print("Num sentences:", textstat.sentence_count(your_text)) print("Num chars:", textstat.char_count(your_text, ignore_spaces=True)) print("Num words:", len(your_text.split())) </code></pre>

How do the count the number of sentences, words and characters in a file?

Tags:

python

nltk

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

731

asked Feb 22 '11 05:02

aks

3 Answers

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

Hope this helps

137

answered Sep 28 '22 02:09

inspectorG4dget

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)

And in your case:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

answered Sep 28 '22 01:09

TheIdealis

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

answered Sep 28 '22 00:09

salvu

Related questions
                            
                                Python beautiful soup arguments
                            
                                Get text when enter is pressed in a text box in wxPython
                            
                                Binary file email attachment problem
                            
                                numpy array C api
                            
                                Catching / blocking SIGINT during system call
                            
                                Can I use Ruby and Python together?
                            
                                Is there a lib to generate data according to a regexp? (Python or other)
                            
                                How to avoid NotImplementedError "Only tempfile.TemporaryFile is available for use" in django on Google App Engine?
                            
                                Using Python code coverage tool for understanding and pruning back source code of a large library
                            
                                Multiple authentication options with Tornado
                            
                                Does Python Pickle have an illegal character/sequence I can use as a separator?
                            
                                Why are some callable attributes not listed by the dir() function?
                            
                                Is this a bug? Variables are identical references to the same string in this example (Python)
                            
                                IPython doesn't find the Shell.IPShell class
                            
                                Tracking global migration to Python 3.x
                            
                                Nonlinear e^(-x) regression using scipy, python, numpy
                            
                                Good way to generate GUIDs on app engine?
                            
                                Efficiently solving a letter/number problem in Python
                            
                                The Web 2.0 Ecosystem/Stack
                            
                                Video meta data using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With