Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching words from vectors of strings in R

I'm trying to clean up a database by matching a messy list of site names with an approved list.

As an example, the preferred site name might be 'Cotswold Water Park Pit 28' but the site has been entered into the database as: 'Pit 28', '28', 'CWP Pit 28', and 'Cotswold 28'.

The data looks something like this:

approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")

messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")

I'm looking for a way to match the words/numbers (clusters of non-space characters) in each element in messy with the words/numbers in each element in approved. Ideally I'd end up with something like this:

     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
[2,] "28"                       "CWP Pit 28"               NA             
[3,] "CWP Pit 28"               "14"                       NA             
[4,] "Cotswold 28"              NA                         NA   

The approved elements form the column names and any elements from messy which containg matching words/numbers appear in the cells of that column. I recognise there will be some false matches. This is fine, I can filter them manually later and might exclude common words like 'forest' and 'hill' from the pattern matching.

I've been able to get the result I want with the above sample data by splitting each element in messy using regex but then I'm dealing with lists of words/numbers from a list of site names and I've been having to use nested loops or sapply to match them to the elements in approved because functions like grep, grepl and str_detect only allow for one pattern. As the database is big this has been taking a long time when I apply it to the whole thing. What I'd really like is a function which does:

match(any word in approved[1], any word in messy[1])

Either giving me a TRUE FALSE output or extracting messy[1] if it matches would be great!

like image 889
James Avatar asked Aug 13 '20 14:08

James


2 Answers

Maybe you are looking for adist:

x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i]  <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "14"                       "Robinswood"   
#[2,] "28"                       NA                         NA             
#[3,] "CWP Pit 28"               NA                         NA             
#[4,] "Cotswold 28"              NA                         NA             
like image 64
GKi Avatar answered Sep 21 '22 10:09

GKi


A base R option would be :

result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28"      "28"          "CWP Pit 28"  "Cotswold 28"

#$`Cotswold Water Park Pit 14`
#[1] "Pit 28"      "CWP Pit 28"  "Cotswold 28" "14"         

#$`Robinswood Hill`
#[1] "Robinswood"

The logic here is that we insert pipe (|) symbol at every whitespace in approved and return the word in messy if any word matches.

To get output in the same format as shown we can do :

sapply(result, `[`, 1:max(lengths(result)))

#     Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28"                   "Pit 28"                   "Robinswood"   
#[2,] "28"                       "CWP Pit 28"               NA             
#[3,] "CWP Pit 28"               "Cotswold 28"              NA             
#[4,] "Cotswold 28"              "14"                       NA   
like image 20
Ronak Shah Avatar answered Sep 17 '22 10:09

Ronak Shah