Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R how to speed up pattern matching using vectors

I have a column in one dataframe with city and state names in it:

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

ac <- as.data.frame(ac)

I would like to search for the values in ac$ac in another data frame column, d$description and return the value of column id if there is a match.

dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L, 
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L, 
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95", 
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission", 
"residential all distance telephone service  unlimited  voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission", 
"residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking", 
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125", 
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

I have tried to do this via accessing the row indexes of the matches via the following methods:

  1. which(ac$ac %in% df$description)--this returns integer(0).
  2. grep(ac$ac, df$description, value = FALSE)--this returns the first index, 1. But this isn't vectorized.
  3. str_detect(string = ac$ac, pattern = df$description) -- but this returns all FALSE which is incorrect.

My question: how do I search for ac$ac in df$description and return the corresponding value of df$id in the event of a match? Note that the vectors are not of the same length. I am looking for ALL matches, not just the first. I would prefer something simple and fast, because the actual datasets that I will be using have over 100k rows each but any suggestions or ideas are welcome. Thanks.

Edit. Due to Andre's initial answer below, the name of the question was changed to account for the change in the scope of the question.

Edit (12/7): bounty added to generate additional interest and a fast, efficient scalable solution.

Edit (12/8): Clarification--I would like to be able to add the id variable from df to the ac dataframe, as in ac$id.

like image 745
jvalenti Avatar asked Dec 05 '21 17:12

jvalenti


People also ask

How do you use pattern matching in R?

R Functions for Pattern Matching 1. Finding strings: grep. grep(pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE. Example:

How to return the matching element values in a vector string?

grep (pattern, string) returns by default a list of indices. If the regular expression, pattern, matches a particular element in the vector string, it returns the element's index. For returning the actual matching element values, set the option value to TRUE by value=TRUE. 2.

What is the use of no match in R?

match() function in R Language is used to return the positions of the first match of the elements of the first vector in the second vector. If the element is not found, it returns NA. Syntax: match(x1, x2, nomatch) Parameters: x1: Vector 1 x2: Vector 2 nomatch: value to be returned in case of no match Example 1:

How to compare two r vectors in R?

R Match – Using match () and %in% to compare vectors. Today we’re going to discuss how to compare two R vectors for the elements (values) which they have in common. We have two options here: the %in% operator – returns a vector of True / False results which indicates if a value in the first vector was present in the second.


Video Answer


5 Answers

The simplest solutions are usually the fastest! Here is my suggestion:

str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]

But you can also this way

df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]

Or this way

df$id[grepl(str, df$description, perl=T)]

However, it has to be compared. By the way, I added suggestions from @Andre Wildberg and @Martina C. Arnolda. Below is the Benchmark.

str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]

fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]

fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fFiolka1(), fFiolka2(), fFiolka3(),
  fWildberg1(), fWildberg2(),
  fArnolda1(), fArnolda2(), fArnolda3(),
  times=100))

enter image description here

Note, for the sake of simplicity I left ac as a vector !.

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")

Special update for @jvalenti

OKAY. Now I understand better what you want to achieve. However, in order to fully show the best solution, I have slightly modified your data. Here they are

library(tidyverse)

ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)

df = structure(list(
  month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L), 
  id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L), 
  description = c(
    "residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
    "digital video san francisco ca pittsburgh pa  multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
    "residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
    "local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
    "residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
  )), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")

Below you will find four different solutions. One based on the for loop, two solutions based on the functions from the dplyr package, and yet a function from the collapse package.

fSolition1 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()

fSolition2 = function(){
  ac %>% group_by(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition2()

fSolition3 = function(){
  ac %>% rowwise(ac) %>% 
  mutate(id = list(df$id[grep(ac, df$description)])) %>% 
  unnest(id)
}
fSolition3()

fSolition4 = function(){
ac %>%  
  collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>% 
  unnest(id)
}
fSolition4()

Note that for the given data, all functions that return the following table as a result

# A tibble: 12 x 2
   ac                      id
   <chr>                <int>
 1 san francisco ca 100559687
 2 san francisco ca 100558763
 3 san francisco ca 100558946
 4 pittsburgh pa    100559687
 5 pittsburgh pa    100558763
 6 pittsburgh pa    100558934
 7 pittsburgh pa    100558946
 8 pittsburgh pa    100547618
 9 philadelphia pa  100559687
10 philadelphia pa  100558946
11 philadelphia pa  100547618
12 manhattan ks     100547618

It's time for a benchmark


library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))

enter image description here

It is perhaps no surprise to anyone that the collapse based solution is the fastest. However, second place may be a big surprise. The good old solution based on the for function is in second place!! Anyone else want to say that for is slow?

Special update for @Gwang-Jin Kim

The actions on vectors did not change much. Look below.

df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
  id = vector("list", length = length(df_ac))
  for(i in seq_along(df_ac)){
    id[[i]] = df_id[grep(df_ac[i], df_decription)]
  }
  ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))

enter image description here

But the combination of for and ftransform can be surprising !!!

fSolition6 = function(){
  id = vector("list", nrow(ac))
  for(i in seq_along(ac$ac)){
    id[[i]] = df$id[grep(ac$ac[i], df$description)]
  }
  ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()

library(microbenchmark)
ggplot2::autoplot(microbenchmark(
  fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))

enter image description here

Last update for @jvalenti

Dear jvaleniti, in your question you wrote I have a column in one dataframe with city and state names and then I will be using have over 100k rows. My conclusion is that it is very likely that a given city will appear several times in your variable description.

However, in the comment you wrote I don't want to change the number of rows in ac So what kind of results do you expect? Let's see what can be done with it.

Solution 1 - we return all id as a list of vectors

ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)])) 
# # A tibble: 8 x 2
# ac               id       
# * <chr>            <list>   
#   1 san francisco ca <int [3]>
#   2 pittsburgh pa    <int [5]>
#   3 philadelphia pa  <int [3]>
#   4 washington dc    <int [0]>
#   5 new york ny      <int [0]>
#   6 aliquippa pa     <int [0]>
#   7 gainesville fl   <int [0]>
#   8 manhattan ks     <int [1]>

Solution 2 - we only return the first id

ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1])) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100559687
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Solution 3 - we only return the last id

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa    100547618
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Solution 4 - or maybe you would like to choose any id from all possible

ac %>%
  collapse::ftransform(id = map_int(ac, function(x) {
    idx = grep(x, df$description)
    ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))})) 
# # A tibble: 8 x 2
# ac                      id
# * <chr>                <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa    100559687
# 3 philadelphia pa  100547618
# 4 washington dc           NA
# 5 new york ny             NA
# 6 aliquippa pa            NA
# 7 gainesville fl          NA
# 8 manhattan ks     100547618

Solution 5 - if you accidentally wanted to see all the id's and wanted to keep the number of ac lines at the same time

ac %>%
  collapse::ftransform(id = map(ac, function(x) {
    idx = grep(x, df$description)
    if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
      id = df$id[idx],
      idn = paste0("id",1:length(id)))})) %>% 
  unnest(id) %>% 
  pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac                     id1       id2       id3       id4       id5
# <chr>                <int>     <int>     <int>     <int>     <int>
# 1 san francisco ca 100559687 100558763 100558946        NA        NA
# 2 pittsburgh pa    100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa  100559687 100558946 100547618        NA        NA
# 4 washington dc           NA        NA        NA        NA        NA
# 5 new york ny             NA        NA        NA        NA        NA
# 6 aliquippa pa            NA        NA        NA        NA        NA
# 7 gainesville fl          NA        NA        NA        NA        NA
# 8 manhattan ks     100547618        NA        NA        NA        NA    

Unfortunately, the description provided by you does not indicate which of the above five solutions is an acceptable solution for you. You will have to decide for yourself.

like image 152
Marek Fiołka Avatar answered Oct 22 '22 23:10

Marek Fiołka


Perhaps this is an option?

ac$id <- sapply(ac$ac, function(x) d$id[grep(x, d$description)])
#                 ac        id
# 1 san francisco ca 100559687
# 2    pittsburgh pa 100558946
# 3  philadelphia pa          
# 4    washington dc          
# 5      new york ny          
# 6     aliquippa pa          
# 7   gainesville fl          
# 8     manhattan ks 100547618
like image 24
Wimpel Avatar answered Oct 23 '22 01:10

Wimpel


Try this sapply with grep.

df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ]
[1] 100559687 100558946 100547618

EDIT, try stri_detect_regex from stringi. Should be 2-5 times faster.

library(stringi)

df$id[ as.logical( rowSums( sapply( ac$ac, function(x) 
  stri_detect_regex( df$description, x ) ) ) ) ]
[1] 100559687 100558946 100547618

Microbenchmark on an extended data set with 1.728M rows:
Memory should not be a problem unless you are using a system with less than 4Gb RAM total.

nrow(df)
[1] 1728000

library(microbenchmark)

microbenchmark( 
  "grep1" = { res <- sapply(ac$ac, function(x) df$id[grep(x, df$description)]) },
  "grep2" = { res <- df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ] },
  "stringi" = { res <- df$id[ as.logical( rowSums( sapply( ac$ac, function(x) stri_detect_regex( df$description, x ) ) ) ) ] }, times=10 )

Unit: seconds
   expr      min       lq      mean   median        uq       max neval cld
  grep1 96.90757 97.98706 100.13299 99.05837 101.99050 107.04312    10   b
  grep2 97.51382 97.66425 100.00610 99.20753 101.17921 106.86661    10   b
stringi 46.15548 46.65894  48.68073 47.29635  50.15713  53.50351    10  a

Memory footprint during microbenchmark:
Path: /Library/Frameworks/R.framework/Versions/4.0/Resources/bin/exec/R
Physical footprint: 638.3M
Physical footprint (peak): 1.8G

like image 32
Andre Wildberg Avatar answered Oct 22 '22 23:10

Andre Wildberg


You can use regex_inner_join from package fuzzyjoin

> library(fuzzyjoin)

> regex_inner_join(df, ac, by = c(description = "ac"))
   month        id
1 202110 100559687
2 201703 100558946
3 201502 100547618

                                                              description
1 residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95
2               residential all distance telephone service  unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking
3                                                                                                                                                                                      residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online
                ac
1 san francisco ca
2    pittsburgh pa
3     manhattan ks
like image 34
ThomasIsCoding Avatar answered Oct 23 '22 00:10

ThomasIsCoding


Checking using a regular expression and non-expensive functions should be fast:

First, we generate the pattern to be checked: ac_regex <- paste(ac$ac, collapse = "|").

There are several ways to detect matches in description and subset. Here are three:

# 1 grep()
df[grep(ac_regex, df$description), ]["id"],
# 2 stringi::stri_detect_*()
df[stri_detect_regex(df$description, ac_regex), ]["id"],
# 3 stringr::str_detect() + tidy subsetting
df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),

All three return the desired subset of df:

         id
1 100559687
2 100558946
3 100547618

(You need the packages tidyverse and stringi for options 2 and 3.)

Let's benchmark (using package bench):

bench::mark(
  base_grep = df[grep(ac_regex, df$description), ]["id"],
  base_stringi = df[stringi::stri_detect_regex(df$description, ac_regex), ]["id"],
  tidy = df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
  check = F
)
  expression     median 
  <bch:expr>   <bch:tm>   
1 base_grep    146.61µs      
2 base_stringi  119.6µs     
3 tidy           1.99ms   

I'd go with stringi!

like image 2
Martin C. Arnold Avatar answered Oct 22 '22 23:10

Martin C. Arnold