Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

determine frequency of string using grep [duplicate]

Tags:

r

frequency

if I have a vector

x <- c("ajjss","acdjfkj","auyjyjjksjj")

and do:

y <- x[grep("jj",x)]
table(y)

I get:

y
      ajjss auyjyjjksjj 
          1           1 

However the second string "auyjyjjksjj" should count the substring "jj" twice. How can I change this from a true/false computation, to actually counting the frequency of "jj"?

Also if for each string the frequency of the substring divided by the string's length could be calculated that would be great.

Thanks in advance.

like image 559
brucezepplin Avatar asked Mar 24 '13 16:03

brucezepplin


3 Answers

I solved this using gregexpr()

x <- c("ajjss","acdjfkj","auyjyjjksjj")
freq <- sapply(gregexpr("jj",x),function(x)if(x[[1]]!=-1) length(x) else 0)
df<-data.frame(x,freq)

df
#            x freq
#1       ajjss    1
#2     acdjfkj    0
#3 auyjyjjksjj    2

And for the last part of the question, calculating frequency / string length...

df$rate <- df$freq / nchar(as.character(df$x))

It is necessary to convert df$x back to a character string because data.frame(x,freq) automatically converts strings to factors unless you specify stringsAsFactors=F.

df
#            x freq      rate
#1       ajjss    1 0.2000000
#2     acdjfkj    0 0.0000000
#3 auyjyjjksjj    2 0.1818182
like image 88
ndoogan Avatar answered Oct 16 '22 20:10

ndoogan


You're using the wrong tool. Try gregexpr, which will give you the positions where the search string was found (or -1 if not found):

> gregexpr("jj", x, fixed = TRUE)
[[1]]
[1] 2
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

[[3]]
[1]  6 10
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE
like image 22
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 16 '22 21:10

A5C1D2H2I1M1N2O1R2T1


You can use qdap (though not in base install R):

x <- c("ajjss","acdjfkj","auyjyjjksjj")
library(qdap)
termco(x, seq_along(x), "jj")

## > termco(x, seq_along(x), "jj")
##   x word.count         jj
## 1 1          1 1(100.00%)
## 2 2          1          0
## 3 3          1 2(200.00%)

Note that the output has frequency and frequency compared to word count (the output is actually a list but prints a pretty output). To access the frequencies:

termco(x, seq_along(x), "jj")$raw

## > termco(x, seq_along(x), "jj")$raw
##   x word.count jj
## 1 1          1  1
## 2 2          1  0
## 3 3          1  2
like image 3
Tyler Rinker Avatar answered Oct 16 '22 20:10

Tyler Rinker