I'd like to count a defined pattern (here: 'Y') in a string for each row of a dataframe. Ideally, I'd like to get a number of occurrences in V3 and length in V4.
Input:
V1 V2
A XXYYYYY
B XXYYXX
C XYXXYX
D XYYXYX
Output:
V1 V2 V3 V4
A XXYYYYY 1 5
B XXYYXX 1 2
C XYXXYX 2 1,1
D XYYXYX 2 2,1
I tried different modifications of the function below, with no success.
dict <- setNames(nm=c("Y"))
seqs <- df$V2
sapply(dict, str_count, string=seqs)
Thanks in advance!
another base R solution but using regexpr:
df <- data.frame(
V1 = c("A", "B", "C", "D"),
V2 = c("XXYYYYY", "XXYYXX" , "XYXXYX", "XYYXYX")
)
extract match.length attribute of the regexpr output, then count length of each attribute (which tells you how many matches there are):
r <- gregexpr("Y+", df$V2)
len <- lapply(r, FUN = function(x) as.array((attributes(x)[[1]])))
df$V3 <- lengths(len)
df$V4 <- len
df
#V1 V2 V3 V4
#1 A XXYYYYY 1 5
#2 B XXYYXX 1 2
#3 C XYXXYX 2 1, 1
#4 D XYYXYX 2 2, 1
if you have an old version of R that doesn't have lengths yet you can use df$V3 <- sapply(len, length) instead.
and if you need a more generic function to do the same for any vector x and pattern a:
foo <- function(x, a){
ans <- data.frame(x)
r <- gregexpr(a, x)
len <- lapply(r, FUN = function(z) as.array((attributes(z)[[1]])))
ans$quantity <- lengths(len)
ans$lengths <- len
ans
}
try foo(df$V2, 'Y+').
Here is a stringr solution:
df <- data.frame(
V1 = c("A", "B", "C", "D"),
V2 = c("XXYYYYY", "XXYYXX" , "XYXXYX", "XYYXYX")
)
df$V3 <- str_count(df$V2, "Y+")
df$V4 <- lapply(str_locate_all(df$V2, "Y+"), function(x) {
paste(x[, 2] - x[, 1] + 1, collapse = ",")
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With