I have got two extremely large dataframes, the first data frame consists of a column body
, which is a list of comments and the second one consists of names
. I want to count how many elements in body
contain each element of names
. Here's a small reproducible dataset (the original dataset has about 2000 names, where each name is a name of the car):
df1 <- tibble(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- tibble(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
As mentioned above, I am trying to count the number of times each of the values in names occur in body and then preferably, I want to add it as a column in df2
. I have tried it in the following way:
counter = c()
for (i in df2$names) {
counter[i] = sum(ifelse(str_detect(df1$body, i),1, 0))
}
While this method works, it takes extremely long amounts of time and returns a vector where the names are attributes of the counter
values, I then unstack it and join the dataframe to df2
using names
as keys.
This is the only method that works, apart from that I have tried using str_count
but with my current proficiency in R, the code was absolutely horrible and got me nowhere.
Is there a more efficient way of finding the matching strings? I have tried to find similar questions on stack but to no avail!
Many Thanks in advance :)
Something like this?
df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
library(tidyverse)
df2 %>%
mutate(des_count = map_int(tolower(names), ~ sum(str_detect(tolower(df1$body), .x))))
#> names des_count
#> 1 FORD 0
#> 2 TESLA 3
#> 3 MERCEDES 1
#> 4 FERRARI 1
#> 5 JAGUAR 0
#> 6 HYUNDAI 0
Created on 2021-05-13 by the reprex package (v2.0.0)
Or if you want to use baseR
df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
"ferrari needs to make an electric car",
"How much does a tesla cost?",
"When is the new Mercedes releasing?",
"Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))
df2$desired_count <- sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
df2
#> names desired_count
#> 1 FORD 0
#> 2 TESLA 3
#> 3 MERCEDES 1
#> 4 FERRARI 1
#> 5 JAGUAR 0
#> 6 HYUNDAI 0
Created on 2021-05-13 by the reprex package (v2.0.0)
You could use rowwise
and grepl
, which I think i a bit faster than str_detect
:
df1 <- df1 %>%
mutate(body = tolower(body))
df2 %>%
mutate(names = tolower(names)) %>%
rowwise() %>%
mutate(counter = sum(grepl(names,tolower(df1$body),fixed = TRUE )))
# A tibble: 6 x 2
# Rowwise:
names counter
<chr> <int>
1 ford 0
2 tesla 3
3 mercedes 1
4 ferrari 1
5 jaguar 0
6 hyundai 0
As the question is about speed, here is the benchmark:
df1 <- df1 %>%
mutate(body = tolower(body))
df2 <- df2 %>%
mutate(names = tolower(names))
anilgoyal = function(){
df2 %>%
mutate(des_count = map_int(names, ~ sum(str_detect(df1$body, .x))))
}
anigoyal2 = function(){
sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
}
denis = function(){
df2 %>%
rowwise() %>%
mutate(counter = sum(grepl(names,df1$body ,fixed = T)))
}
Anoushiravan = function(){
df1 %>%
rowwise() %>%
mutate(match = df2$names[which(str_detect(body, fixed(df2$names,
ignore_case = TRUE)))]) -> df3
df2 %>%
mutate(cnt = map_chr(names, ~ sum(str_detect(df3$match, .x))))
}
chris = function(){
df2 %>%
rowwise() %>%
mutate(count = sum(grepl(paste0("(?i)", names), df1$body)))
}
The results
library(microbenchmark)
microbenchmark(denis(),anilgoyal(),anigoyal2(),Anoushiravan(),chris(),times = 100)
Unit: microseconds
expr min lq mean median uq max neval cld
denis() 5960.6 7059.85 10644.711 8692.50 11533.90 49709.7 100 c
anilgoyal() 3614.2 4385.55 6660.244 4886.60 7195.65 31088.9 100 b
anigoyal2() 153.4 203.00 315.966 239.35 285.45 2010.8 100 a
Anoushiravan() 10083.4 12522.40 19994.135 15355.85 20469.60 100866.2 100 d
chris() 5971.7 7060.55 11353.754 8356.35 10727.10 98319.3 100 c
Base R is much more efficient ! Bravo @AnilGoyal
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With