I have a keyword (e.g. 'green') and some text ("I do not like them Sam I Am!").
I'd like to see how many of the characters in the keyword ('g', 'r', 'e', 'e', 'n') occur in the text (in any order).
In this example the answer is 3 - the text doesn't have a G or R but has two Es and an N.
My problem arises where if a character in the text is matched with a character in the keyword, then it can't be used to match a different character in the keyword.
For example, if my keyword was 'greeen', the number of "matching characters" is still 3 (one N and two Es) because there are only two Es in the text, not 3 (to match the third E in the keyword).
How can I write this in R? This is just ticking something at the edge of my memory - I feel like it's a common problem but just worded differently (sort of like sampling with no replacement, but "matches with no replacement"?).
E.g.
keyword <- strsplit('greeen', '')[[1]]
text <- strsplit('idonotlikethemsamiam', '')[[1]]
# how many characters in keyword have matches in text,
# with no replacement?
# Attempt 1: sum(keyword %in% text)
# PROBLEM: returns 4 (all three Es match, but only two in text)
More examples of expected input/outputs (keyword, text, expected output):
The function pmatch() is great for this. Though it would be instinctual to use length here, length has no na.rm option. So to work around this nuisance, sum(!is.na()) is used.
keyword <- unlist(strsplit('greeen', ''))
text <- unlist(strsplit('idonotlikethemsamiam', ''))
sum(!is.na(pmatch(keyword, text)))
# [1] 3
keyword2 <- unlist(strsplit("red", ''))
sum(!is.na(pmatch(keyword2, text)))
# [1] 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With