determine frequency of string using grep [duplicate]

Question

if I have a vector

x <- c("ajjss","acdjfkj","auyjyjjksjj")

and do:

y <- x[grep("jj",x)]
table(y)

I get:

y
      ajjss auyjyjjksjj 
          1           1

However the second string "auyjyjjksjj" should count the substring "jj" twice. How can I change this from a true/false computation, to actually counting the frequency of "jj"?

Also if for each string the frequency of the substring divided by the string's length could be calculated that would be great.

Thanks in advance.

ndoogan · Accepted Answer

I solved this using gregexpr()

x <- c("ajjss","acdjfkj","auyjyjjksjj")
freq <- sapply(gregexpr("jj",x),function(x)if(x[[1]]!=-1) length(x) else 0)
df<-data.frame(x,freq)

df
#            x freq
#1       ajjss    1
#2     acdjfkj    0
#3 auyjyjjksjj    2

And for the last part of the question, calculating frequency / string length...

df$rate <- df$freq / nchar(as.character(df$x))

It is necessary to convert df$x back to a character string because data.frame(x,freq) automatically converts strings to factors unless you specify stringsAsFactors=F.

df
#            x freq      rate
#1       ajjss    1 0.2000000
#2     acdjfkj    0 0.0000000
#3 auyjyjjksjj    2 0.1818182

A5C1D2H2I1M1N2O1R2T1 · Answer

You're using the wrong tool. Try gregexpr, which will give you the positions where the search string was found (or -1 if not found):

> gregexpr("jj", x, fixed = TRUE)
[[1]]
[1] 2
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

[[3]]
[1]  6 10
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

Tyler Rinker · Answer

You can use qdap (though not in base install R):

x <- c("ajjss","acdjfkj","auyjyjjksjj")
library(qdap)
termco(x, seq_along(x), "jj")

## > termco(x, seq_along(x), "jj")
##   x word.count         jj
## 1 1          1 1(100.00%)
## 2 2          1          0
## 3 3          1 2(200.00%)

Note that the output has frequency and frequency compared to word count (the output is actually a list but prints a pretty output). To access the frequencies:

termco(x, seq_along(x), "jj")$raw

## > termco(x, seq_along(x), "jj")$raw
##   x word.count jj
## 1 1          1  1
## 2 2          1  0
## 3 3          1  2

determine frequency of string using grep [duplicate]

Tags:

r

frequency

brucezepplin

3 Answers

ndoogan

A5C1D2H2I1M1N2O1R2T1

Tyler Rinker

Recent Activity

Donate For Us

determine frequency of string using grep [duplicate]

Tags:

r

frequency

brucezepplin

3 Answers

ndoogan

A5C1D2H2I1M1N2O1R2T1

Tyler Rinker

Related questions

Recent Activity

Donate For Us