Is there a function to count the number of words in a string? For example: <pre class="prettyprint"><code>str1 <- "How many words are in this sentence" </code></pre> to return a result of 7.

Most simple way would be: <pre class="prettyprint"><code>require(stringr) str_count("one, two three 4,,,, 5 6", "\\S+") </code></pre> ... counting all sequences on non-space characters (<code>\\S+</code>). But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well? <pre class="prettyprint"><code>require(stringr) nwords <- function(string, pseudo=F){ ifelse( pseudo, pattern <- "\\S+", pattern <- "[[:alpha:]]+" ) str_count(string, pattern) } nwords("one, two three 4,,,, 5 6") # 3 nwords("one, two three 4,,,, 5 6", pseudo=T) # 6 </code></pre>

Count the number of all words in a string

Tags:

string

r

word-count

Is there a function to count the number of words in a string? For example:

str1 <- "How many words are in this sentence"

to return a result of 7.

810

asked Jan 19 '12 01:01

John

2 Answers

Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.

lengths(gregexpr("\\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.

Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

str1 = c("", "x", "x y", "x y!" , "x y! z") lengths(gregexpr("[A-z]\\W+", str1)) + 1L # [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0)) # [1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

lengths(strsplit(str1, "\\W+")) # [1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

139

answered Oct 02 '22 01:10

Martin Morgan

Most simple way would be:

require(stringr) str_count("one,   two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-space characters (\\S+).

But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?

require(stringr) nwords <- function(string, pseudo=F){   ifelse( pseudo,            pattern <- "\\S+",            pattern <- "[[:alpha:]]+"          )   str_count(string, pattern) }  nwords("one,   two three 4,,,, 5 6") # 3  nwords("one,   two three 4,,,, 5 6", pseudo=T) # 6

answered Oct 02 '22 01:10

petermeissner

Related questions
                            
                                Remove multiple objects with rm()
                            
                                Generate a dummy-variable
                            
                                Compile R script into standalone .exe file?
                            
                                Split text string in a data.table columns
                            
                                Find the location of a character in string
                            
                                Why are loops slow in R?
                            
                                How can I arrange an arbitrary number of ggplots using grid.arrange?
                            
                                dplyr on data.table, am I really using data.table?
                            
                                How to subset matrix to one column, maintain matrix data type, maintain row/column names?
                            
                                Pandas version of rbind
                            
                                Collapse / concatenate / aggregate a column to a single comma separated string within each group
                            
                                Understanding the order() function
                            
                                How to listen for more than one event expression within a Shiny eventReactive handler
                            
                                How to do vlookup and fill down (like in Excel) in R?
                            
                                How to convert Excel date format to proper date in R
                            
                                How to delete the first row of a dataframe in R?
                            
                                How can I suppress the vertical gridlines in a ggplot2 plot?
                            
                                How to fix the aspect ratio in ggplot?
                            
                                Repeat rows of a data.frame N times
                            
                                Select columns based on string match - dplyr::select

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With