Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of sentences in a text in R?

Tags:

r

text-mining

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!

like image 238
SavedByJESUS Avatar asked Sep 26 '12 08:09

SavedByJESUS


People also ask

How do you count the number of words in a text in R?

Like the first two methods, the str_count() function uses a regular expression to count the number of words in an R string. As a regular expression, it uses \\w+ to only count words that start with letters, numbers, underscores, or asterisks.

How do I count words in a vector in R?

The strsplit() method in R is used to return a vector of words contained in the specified string based on matching with regex defined. Each element of this vector is a substring of the original string. The length of the returned vector is therefore equivalent to the number of words.


1 Answers

Thank you @gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:

install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language

text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by @gui11aume

x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).

[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."

length(x) ## Displays the number of sentences in the string vector (or text).

[1] 2

The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.

Three more languages are supported in the package. You just need to install and load the corresponding model files.

  1. {openNLPmodels.es} for Spanish
  2. {openNLPmodels.ge} for German
  3. {openNLPmodels.th} for Thai
like image 176
SavedByJESUS Avatar answered Nov 03 '22 22:11

SavedByJESUS