I have a column in one dataframe with city and state names in it:
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac <- as.data.frame(ac)
I would like to search for the values in ac$ac
in another data frame column, d$description
and return the value of column id
if there is a match.
dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L,
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L,
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone service unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
I have tried to do this via accessing the row indexes of the matches via the following methods:
which(ac$ac %in% df$description)
--this returns integer(0)
.grep(ac$ac, df$description, value = FALSE)
--this returns the first index, 1. But this isn't vectorized.str_detect(string = ac$ac, pattern = df$description)
-- but this returns all FALSE
which is incorrect.My question: how do I search for ac$ac
in df$description
and return the corresponding value of df$id
in the event of a match? Note that the vectors are not of the same length. I am looking for ALL matches, not just the first. I would prefer something simple and fast, because the actual datasets that I will be using have over 100k rows each but any suggestions or ideas are welcome. Thanks.
Edit. Due to Andre's initial answer below, the name of the question was changed to account for the change in the scope of the question.
Edit (12/7): bounty added to generate additional interest and a fast, efficient scalable solution.
Edit (12/8): Clarification--I would like to be able to add the id
variable from df
to the ac
dataframe, as in ac$id
.
R Functions for Pattern Matching 1. Finding strings: grep. grep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE. Example:
grep (pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE. 2.
match() function in R Language is used to return the positions of the first match of the elements of the first vector in the second vector. If the element is not found, it returns NA. Syntax: match(x1, x2, nomatch) Parameters: x1: Vector 1 x2: Vector 2 nomatch: value to be returned in case of no match Example 1:
R Match – Using match () and %in% to compare vectors. Today we’re going to discuss how to compare two R vectors for the elements (values) which they have in common. We have two options here: the %in% operator – returns a vector of True / False results which indicates if a value in the first vector was present in the second.
The simplest solutions are usually the fastest! Here is my suggestion:
str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]
But you can also this way
df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
Or this way
df$id[grepl(str, df$description, perl=T)]
However, it has to be compared. By the way, I added suggestions from @Andre Wildberg and @Martina C. Arnolda. Below is the Benchmark.
str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]
fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]
fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fFiolka1(), fFiolka2(), fFiolka3(),
fWildberg1(), fWildberg2(),
fArnolda1(), fArnolda2(), fArnolda3(),
times=100))
Note, for the sake of simplicity I left ac as a vector !.
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
OKAY. Now I understand better what you want to achieve. However, in order to fully show the best solution, I have slightly modified your data. Here they are
library(tidyverse)
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)
df = structure(list(
month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L),
id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L),
description = c(
"residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video san francisco ca pittsburgh pa multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
Below you will find four different solutions. One based on the for
loop, two solutions based on the functions from the dplyr
package, and yet a function from the collapse
package.
fSolition1 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()
fSolition2 = function(){
ac %>% group_by(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition2()
fSolition3 = function(){
ac %>% rowwise(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition3()
fSolition4 = function(){
ac %>%
collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>%
unnest(id)
}
fSolition4()
Note that for the given data, all functions that return the following table as a result
# A tibble: 12 x 2
ac id
<chr> <int>
1 san francisco ca 100559687
2 san francisco ca 100558763
3 san francisco ca 100558946
4 pittsburgh pa 100559687
5 pittsburgh pa 100558763
6 pittsburgh pa 100558934
7 pittsburgh pa 100558946
8 pittsburgh pa 100547618
9 philadelphia pa 100559687
10 philadelphia pa 100558946
11 philadelphia pa 100547618
12 manhattan ks 100547618
It's time for a benchmark
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))
It is perhaps no surprise to anyone that the collapse
based solution is the fastest. However, second place may be a big surprise. The good old solution based on the for
function is in second place!! Anyone else want to say that for
is slow?
The actions on vectors did not change much. Look below.
df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
id = vector("list", length = length(df_ac))
for(i in seq_along(df_ac)){
id[[i]] = df_id[grep(df_ac[i], df_decription)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))
But the combination of for
and ftransform
can be surprising !!!
fSolition6 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))
Dear jvaleniti, in your question you wrote I have a column in one dataframe with city and state names and then I will be using have over 100k rows. My conclusion is that it is very likely that a given city will appear several times in your variable description
.
However, in the comment you wrote I don't want to change the number of rows in ac So what kind of results do you expect? Let's see what can be done with it.
Solution 1 - we return all id
as a list of vectors
ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)]))
# # A tibble: 8 x 2
# ac id
# * <chr> <list>
# 1 san francisco ca <int [3]>
# 2 pittsburgh pa <int [5]>
# 3 philadelphia pa <int [3]>
# 4 washington dc <int [0]>
# 5 new york ny <int [0]>
# 6 aliquippa pa <int [0]>
# 7 gainesville fl <int [0]>
# 8 manhattan ks <int [1]>
Solution 2 - we only return the first id
ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1]))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100559687
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 3 - we only return the last id
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa 100547618
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 4 - or maybe you would like to choose any id
from all possible
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 5 - if you accidentally wanted to see all the id's and wanted to keep the number of ac
lines at the same time
ac %>%
collapse::ftransform(id = map(ac, function(x) {
idx = grep(x, df$description)
if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
id = df$id[idx],
idn = paste0("id",1:length(id)))})) %>%
unnest(id) %>%
pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac id1 id2 id3 id4 id5
# <chr> <int> <int> <int> <int> <int>
# 1 san francisco ca 100559687 100558763 100558946 NA NA
# 2 pittsburgh pa 100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa 100559687 100558946 100547618 NA NA
# 4 washington dc NA NA NA NA NA
# 5 new york ny NA NA NA NA NA
# 6 aliquippa pa NA NA NA NA NA
# 7 gainesville fl NA NA NA NA NA
# 8 manhattan ks 100547618 NA NA NA NA
Unfortunately, the description provided by you does not indicate which of the above five solutions is an acceptable solution for you. You will have to decide for yourself.
Perhaps this is an option?
ac$id <- sapply(ac$ac, function(x) d$id[grep(x, d$description)])
# ac id
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100558946
# 3 philadelphia pa
# 4 washington dc
# 5 new york ny
# 6 aliquippa pa
# 7 gainesville fl
# 8 manhattan ks 100547618
Try this sapply
with grep
.
df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ]
[1] 100559687 100558946 100547618
EDIT, try stri_detect_regex
from stringi
. Should be 2-5 times faster.
library(stringi)
df$id[ as.logical( rowSums( sapply( ac$ac, function(x)
stri_detect_regex( df$description, x ) ) ) ) ]
[1] 100559687 100558946 100547618
Microbenchmark on an extended data set with 1.728M rows:
Memory should not be a problem unless you are using a system with less than 4Gb RAM total.
nrow(df)
[1] 1728000
library(microbenchmark)
microbenchmark(
"grep1" = { res <- sapply(ac$ac, function(x) df$id[grep(x, df$description)]) },
"grep2" = { res <- df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ] },
"stringi" = { res <- df$id[ as.logical( rowSums( sapply( ac$ac, function(x) stri_detect_regex( df$description, x ) ) ) ) ] }, times=10 )
Unit: seconds
expr min lq mean median uq max neval cld
grep1 96.90757 97.98706 100.13299 99.05837 101.99050 107.04312 10 b
grep2 97.51382 97.66425 100.00610 99.20753 101.17921 106.86661 10 b
stringi 46.15548 46.65894 48.68073 47.29635 50.15713 53.50351 10 a
Memory footprint during microbenchmark:
Path: /Library/Frameworks/R.framework/Versions/4.0/Resources/bin/exec/R
Physical footprint: 638.3M
Physical footprint (peak): 1.8G
You can use regex_inner_join
from package fuzzyjoin
> library(fuzzyjoin)
> regex_inner_join(df, ac, by = c(description = "ac"))
month id
1 202110 100559687
2 201703 100558946
3 201502 100547618
description
1 residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95
2 residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking
3 residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online
ac
1 san francisco ca
2 pittsburgh pa
3 manhattan ks
Checking using a regular expression and non-expensive functions should be fast:
First, we generate the pattern to be checked: ac_regex <- paste(ac$ac, collapse = "|")
.
There are several ways to detect matches in description
and subset. Here are three:
# 1 grep()
df[grep(ac_regex, df$description), ]["id"],
# 2 stringi::stri_detect_*()
df[stri_detect_regex(df$description, ac_regex), ]["id"],
# 3 stringr::str_detect() + tidy subsetting
df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
All three return the desired subset of df
:
id
1 100559687
2 100558946
3 100547618
(You need the packages tidyverse
and stringi
for options 2 and 3.)
Let's benchmark (using package bench
):
bench::mark(
base_grep = df[grep(ac_regex, df$description), ]["id"],
base_stringi = df[stringi::stri_detect_regex(df$description, ac_regex), ]["id"],
tidy = df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
check = F
)
expression median
<bch:expr> <bch:tm>
1 base_grep 146.61µs
2 base_stringi 119.6µs
3 tidy 1.99ms
I'd go with stringi
!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With