I'm trying to clean up a database by matching a messy list of site names with an approved list.
As an example, the preferred site name might be 'Cotswold Water Park Pit 28' but the site has been entered into the database as: 'Pit 28', '28', 'CWP Pit 28', and 'Cotswold 28'.
The data looks something like this:
approved <- c("Cotswold Water Park Pit 28", "Cotswold Water Park Pit 14", "Robinswood Hill")
messy <- c("Pit 28", "28", "CWP Pit 28", "Cotswold 28", "14", "Robinswood")
I'm looking for a way to match the words/numbers (clusters of non-space characters) in each element in messy
with the words/numbers in each element in approved
. Ideally I'd end up with something like this:
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "14" NA
[4,] "Cotswold 28" NA NA
The approved
elements form the column names and any elements from messy
which containg matching words/numbers appear in the cells of that column. I recognise there will be some false matches. This is fine, I can filter them manually later and might exclude common words like 'forest' and 'hill' from the pattern matching.
I've been able to get the result I want with the above sample data by splitting each element in messy
using regex
but then I'm dealing with lists of words/numbers from a list of site names and I've been having to use nested loops or sapply
to match them to the elements in approved because functions like grep
, grepl
and str_detect
only allow for one pattern. As the database is big this has been taking a long time when I apply it to the whole thing. What I'd really like is a function which does:
match(any word in approved[1], any word in messy[1])
Either giving me a TRUE FALSE
output or extracting messy[1]
if it matches would be great!
Maybe you are looking for adist
:
x <- adist(messy, approved, fixed=FALSE, ignore.case = TRUE)
y <- t(adist(approved, messy, fixed=FALSE, ignore.case = TRUE))
i <- x == apply(x, 1, min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y, 1, min, na.rm=TRUE), 2, function(i) messy[i & !is.na(i)])
do.call(cbind, lapply(i, function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
A base R option would be :
result <- sapply(approved, function(x) grep(gsub('\\s+', '|', x), messy, value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
The logic here is that we insert pipe (|
) symbol at every whitespace in approved
and return the word in messy
if any word matches.
To get output in the same format as shown we can do :
sapply(result, `[`, 1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With