Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find best match for multiple substrings across multiple candidates

Tags:

substring

r

I have the following sample data:

targets <- c("der", "das")
candidates <- c("sdassder", "sderf", "fongs")

Desired Output:

I would like to find sdassder as the Output since it includes the most Matches for targets (as substrings).

What i tried:

x <- sapply(targets, function(target) sapply(candidates, grep, pattern = target)) > 0
which.max(rowSums(x))

Goal:

As you can see, i found some dirty Code that technically yields the result, but i dont feel its a best practise.I hope this Question fits here otherwise i move to Code review.

I tried mapply, do.call, outer, but didnt manage to find a better Code.

Edit:

Adding another Option myself, after seeing the current answers.

Using pipes:

sapply(targets, grepl, candidates) %>% rowSums %>% which.max %>% candidates[.]
like image 671
Tlatwork Avatar asked Dec 19 '19 20:12

Tlatwork


3 Answers

You can simplify it a little, I think.

matches <- sapply(targets, grepl, candidates)
matches
#        der   das
# [1,]  TRUE  TRUE
# [2,]  TRUE FALSE
# [3,] FALSE FALSE

And find the number of matches using rowSums:

rowSums(matches)
# [1] 2 1 0
candidates[ which.max(rowSums(matches)) ]
# [1] "sdassder"

(Note that this last part does not really inform about ties.)

If you want to see the individual matches per-candidate, you can always apply the names manually, though this is only an aesthetic thing, adding very little to the work itself.

rownames(matches) <- candidates
matches
#            der   das
# sdassder  TRUE  TRUE
# sderf     TRUE FALSE
# fongs    FALSE FALSE
rowSums(matches)
# sdassder    sderf    fongs 
#        2        1        0 
which.max(rowSums(matches))
# sdassder 
#        1        <------ this "1" indicates the index within the rowSums vector
names(which.max(rowSums(matches)))
# [1] "sdassder"
like image 59
r2evans Avatar answered Nov 15 '22 05:11

r2evans


One stringr option could be:

candidates[which.max(rowSums(outer(candidates, targets, str_detect)))]

[1] "sdassder"
like image 26
tmfmnk Avatar answered Nov 15 '22 06:11

tmfmnk


We could paste the targets together and create a pattern to match.

library(stringr)
str_c(targets, collapse = "|")
#[1] "der|das"

Use it in str_count to count the number of times pattern was matched.

str_count(candidates, str_c(targets, collapse = "|"))
#[1] 2 1 0

Get the index of maximum value and subset it from original candidates

candidates[which.max(str_count(candidates, str_c(targets, collapse = "|")))]
#[1] "sdassder"
like image 1
Ronak Shah Avatar answered Nov 15 '22 04:11

Ronak Shah