Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - count matches between characters of one string and another, no replacement

Tags:

r

I have a keyword (e.g. 'green') and some text ("I do not like them Sam I Am!").

I'd like to see how many of the characters in the keyword ('g', 'r', 'e', 'e', 'n') occur in the text (in any order).

In this example the answer is 3 - the text doesn't have a G or R but has two Es and an N.

My problem arises where if a character in the text is matched with a character in the keyword, then it can't be used to match a different character in the keyword.

For example, if my keyword was 'greeen', the number of "matching characters" is still 3 (one N and two Es) because there are only two Es in the text, not 3 (to match the third E in the keyword).

How can I write this in R? This is just ticking something at the edge of my memory - I feel like it's a common problem but just worded differently (sort of like sampling with no replacement, but "matches with no replacement"?).

E.g.

keyword <- strsplit('greeen', '')[[1]]
text <- strsplit('idonotlikethemsamiam', '')[[1]]
# how many characters in keyword have matches in text,
# with no replacement?
# Attempt 1: sum(keyword %in% text)
# PROBLEM: returns 4 (all three Es match, but only two in text)

More examples of expected input/outputs (keyword, text, expected output):

  • 'green', 'idonotlikethemsamiam', 3 (G, E, E)
  • 'greeen', 'idonotlikethemsamiam', 3 (G, E, E)
  • 'red', 'idonotlikethemsamiam', 2 (E and D)
like image 592
mathematical.coffee Avatar asked Feb 18 '13 01:02

mathematical.coffee


1 Answers

The function pmatch() is great for this. Though it would be instinctual to use length here, length has no na.rm option. So to work around this nuisance, sum(!is.na()) is used.

keyword <- unlist(strsplit('greeen', ''))
text <- unlist(strsplit('idonotlikethemsamiam', ''))

sum(!is.na(pmatch(keyword, text)))

# [1] 3

keyword2 <- unlist(strsplit("red", ''))
sum(!is.na(pmatch(keyword2, text)))

# [1] 2
like image 97
N8TRO Avatar answered Nov 12 '22 17:11

N8TRO