Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count matches between two strings

Tags:

r

I have two data frames:

df.1 <- data.frame(loc = c('A','B','C','C'), person = c(1,2,3,4), str = c("door / window / table", "window / table / toilet / vase ", "TV / remote / phone / window", "book / vase / car / chair"))

Thus,

  loc person                             str
1   A      1           door / window / table
2   B      2 window / table / toilet / vase 
3   C      3    TV / remote / phone / window
4   C      4       book / vase / car / chair

And,

df.2 <- data.frame(loc = c('A','B','C'), str = c("book / chair / chair", " table / remote / vase ", "window"))

which gives,

  loc                     str
1   A    book / chair / car
2   B  table / remote / vase 
3   C                  window

I want to create a variable df.1$percentage that calculates the percentages of elements in df.1$str that are in df.2$str edit by loc, or:

  loc person                             str percentage
1   A      1           door / window / table       0.00
2   B      2 window / table / toilet / vase        0.50
3   C      3    TV / remote / phone / window       0.25
4   C      4       book / vase / car / chair       0.00

(1 has 0/3, 2 has 2/4 matches, 3 has 1/4, and 4 has 0/4)

Thanks!

like image 219
Lucarno Avatar asked May 29 '13 22:05

Lucarno


People also ask

How do you count pairs in strings?

We need to use a hash table to store the count of all occurrences of a character.So we know if a character occurs twice, then it will have 4 pairs – (i, i), (j, j), (i, j), (j, i). So using a hash function, store the occurrence of each character, then for each character the number of pairs will be occurrence^2.

How do you match a character with two strings?

Approach: Initialize a counter variable with 0. Iterate over the first string from the starting character to ending character. If the character extracted from the first string is found in the second string, then increment the value of the counter by 1.

How do I count the same characters in two strings in Java?

Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.

Which function returns the number of matching characters of two string?

Which function returns the number of matching characters of two string? The strcmp() function is used to compare two strings two strings str1 and str2 . If two strings are same then strcmp() returns 0 , otherwise, it returns a non-zero value.


2 Answers

As you might know, data.frame columns can also hold lists (see Create a data.frame where a column is a list). So you can split your str into lists of words:

df.1 <- transform(df.1, words.1 = I(strsplit(as.character(str), " / ")))
df.2 <- transform(df.2, words.2 = I(strsplit(as.character(str), " / ")))

Then merge your data:

m <- merge(df.1, df.2, by = "loc")

And simply compute the percentage using mapply:

transform(m, percentage = mapply(function(x, y) sum(x%in%y) / length(x),
                                 words.1, words.2))
like image 73
flodel Avatar answered Oct 03 '22 09:10

flodel


Someone can probably come up with a smarter solution, but here's a straightforward approach:

library(data.table)
dt1 = data.table(df.1, key = "loc") # set the key to match by loc
dt2 = data.table(df.2)

dt1[, percentage := dt1[dt2][, # merge
           # clean up spaces and convert to strings
           `:=`(str = gsub(" ", "", as.character(str)),
                str.1 = gsub(" ", "", as.character(str.1)))][,
           # calculate the percentage for each row
           lapply(1:.N, function(i) {
                tmp = strsplit(str, "/")[[i]];
                sum(tmp %in% strsplit(str.1, "/")[[i]])/length(tmp)
           })
   ]]

dt1
#   loc person                             str percentage
#1:   A      1           door / window / table          0
#2:   B      2 window / table / toilet / vase         0.5
#3:   C      3    TV / remote / phone / window       0.25
#4:   C      4       book / vase / car / chair          0
like image 45
eddi Avatar answered Oct 03 '22 11:10

eddi