Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the number of matched strings in R, when the string pattern to match is a column from another dataframe?

Tags:

r

I have got two extremely large dataframes, the first data frame consists of a column body, which is a list of comments and the second one consists of names. I want to count how many elements in body contain each element of names. Here's a small reproducible dataset (the original dataset has about 2000 names, where each name is a name of the car):

df1 <- tibble(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))

df2 <- tibble(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

As mentioned above, I am trying to count the number of times each of the values in names occur in body and then preferably, I want to add it as a column in df2. I have tried it in the following way:

counter = c()
for (i in df2$names) {
  counter[i] = sum(ifelse(str_detect(df1$body, i),1, 0))
}

While this method works, it takes extremely long amounts of time and returns a vector where the names are attributes of the counter values, I then unstack it and join the dataframe to df2 using names as keys. This is the only method that works, apart from that I have tried using str_count but with my current proficiency in R, the code was absolutely horrible and got me nowhere.

Is there a more efficient way of finding the matching strings? I have tried to find similar questions on stack but to no avail!

Many Thanks in advance :)

like image 581
catic_ Avatar asked Dec 02 '22 09:12

catic_


2 Answers

Something like this?

df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

library(tidyverse)            
df2 %>%
  mutate(des_count = map_int(tolower(names), ~ sum(str_detect(tolower(df1$body), .x))))
#>      names des_count
#> 1     FORD         0
#> 2    TESLA         3
#> 3 MERCEDES         1
#> 4  FERRARI         1
#> 5   JAGUAR         0
#> 6  HYUNDAI         0

Created on 2021-05-13 by the reprex package (v2.0.0)

Or if you want to use baseR

df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

df2$desired_count <- sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))

df2
#>      names desired_count
#> 1     FORD             0
#> 2    TESLA             3
#> 3 MERCEDES             1
#> 4  FERRARI             1
#> 5   JAGUAR             0
#> 6  HYUNDAI             0

Created on 2021-05-13 by the reprex package (v2.0.0)

like image 165
AnilGoyal Avatar answered Jan 04 '23 08:01

AnilGoyal


You could use rowwise and grepl, which I think i a bit faster than str_detect:

df1 <- df1 %>%
  mutate(body = tolower(body))

df2 %>%
  mutate(names = tolower(names)) %>%
  rowwise() %>%
  mutate(counter = sum(grepl(names,tolower(df1$body),fixed = TRUE )))

# A tibble: 6 x 2
# Rowwise: 
  names    counter
  <chr>      <int>
1 ford           0
2 tesla          3
3 mercedes       1
4 ferrari        1
5 jaguar         0
6 hyundai        0

As the question is about speed, here is the benchmark:

df1 <- df1 %>%
  mutate(body = tolower(body))
df2 <- df2 %>%
  mutate(names = tolower(names)) 

anilgoyal = function(){
  df2 %>%
    mutate(des_count = map_int(names, ~ sum(str_detect(df1$body, .x))))
}

anigoyal2 = function(){
  sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
}

denis = function(){
  df2 %>%
    rowwise() %>%
    mutate(counter = sum(grepl(names,df1$body ,fixed = T)))
}

Anoushiravan = function(){
  df1 %>%
    rowwise() %>%
    mutate(match = df2$names[which(str_detect(body, fixed(df2$names, 
                                                          ignore_case = TRUE)))]) -> df3
 
  df2 %>%
    mutate(cnt = map_chr(names, ~ sum(str_detect(df3$match, .x))))
}

chris = function(){
  df2 %>%
    rowwise() %>%
    mutate(count = sum(grepl(paste0("(?i)", names), df1$body)))
}

The results

library(microbenchmark)

microbenchmark(denis(),anilgoyal(),anigoyal2(),Anoushiravan(),chris(),times = 100)

Unit: microseconds
           expr     min       lq      mean   median       uq      max neval  cld
        denis()  5960.6  7059.85 10644.711  8692.50 11533.90  49709.7   100   c 
    anilgoyal()  3614.2  4385.55  6660.244  4886.60  7195.65  31088.9   100  b  
    anigoyal2()   153.4   203.00   315.966   239.35   285.45   2010.8   100 a   
 Anoushiravan() 10083.4 12522.40 19994.135 15355.85 20469.60 100866.2   100    d
        chris()  5971.7  7060.55 11353.754  8356.35 10727.10  98319.3   100   c 

Base R is much more efficient ! Bravo @AnilGoyal

like image 30
denis Avatar answered Jan 04 '23 07:01

denis