How to calculate the number of occurrence of a given character in each row of a column of strings?

People also ask

How do you find the number of occurrences of a character in a string?

Use the count() Function to Count the Number of a Characters Occuring in a String in Python. We can count the occurrence of a value in strings using the count() function. It will return how many times the value appears in the given string.

The stringr package provides the str_count function which seems to do what you're interested in

# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)

# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
#  number     string number.of.a
#1      1 greatgreat           2
#2      2      magic           1
#3      3        not           0

If you don't want to leave base R, here's a fairly succinct and expressive possibility:

x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0

nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0

Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.

Here's benchmark results (with a scaled up size of the test to 3000 rows)

 q.data<-q.data[rep(1:NROW(q.data), 1000),]
 str(q.data)
'data.frame':   3000 obs. of  3 variables:
 $ number     : int  1 2 3 1 2 3 1 2 3 1 ...
 $ string     : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ number.of.a: int  2 1 0 2 1 0 2 1 0 2 ...

 benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
 Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
                            sum(unlist(strsplit(x, split = "")) == letter) }) }, 

 DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
 Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
   test replications elapsed  relative user.self sys.self user.child sys.child
1 Dason          100   4.173  9.959427     2.985    1.204          0         0
3  DWin          100   0.419  1.000000     0.417    0.003          0         0
4  Josh          100  18.635 44.474940    17.883    0.827          0         0
2   Tim          100   3.705  8.842482     3.646    0.072          0         0

The stringi package provides the functions stri_count and stri_count_fixed which are very fast.

stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0

benchmark

Compared to the fastest approach from @42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.

library(microbenchmark)

benchmark <- microbenchmark(
  stringi = stringi::stri_count(test.data$string, fixed = "a"),
  baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
  stringr = str_count(test.data$string, "a")
)

autoplot(benchmark)

data

q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]

enter image description here

Another good option, using charToRaw:

sum(charToRaw("abc.d.aa") == charToRaw('.'))

A variation of https://stackoverflow.com/a/12430764/589165 is

> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0

Related questions
                            
                                Regex matching in a Bash if statement
                            
                                Escaping a forward slash in a regular expression
                            
                                java.util.regex - importance of Pattern.compile()?
                            
                                How do I use regex in a SQLite query?
                            
                                Extract a regular expression match
                            
                                Using Regex to generate Strings rather than match them
                            
                                PHP regular expressions: No ending delimiter '^' found in
                            
                                Why is sed not recognizing \t as a tab?
                            
                                RegEx to find two or more consecutive chars
                            
                                List of all special characters that need to be escaped in a regex
                            
                                Remove non-utf8 characters from string
                            
                                How to validate an Email in PHP?
                            
                                Multi-line regex support in Vim
                            
                                Python regex find all overlapping matches?
                            
                                How to capture multiple repeated groups?
                            
                                sed error: "invalid reference \1 on `s' command's RHS"
                            
                                Selecting data frame rows based on partial string match in a column
                            
                                Remove HTML Tags in Javascript with Regex
                            
                                Convert a string to regular expression ruby
                            
                                What is the difference between square brackets and parentheses in a regex?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate the number of occurrence of a given character in each row of a column of strings?

Tags:

regex

dataframe

r

People also ask

Recent Activity

Donate For Us