I have two data frames:
df.1 <- data.frame(loc = c('A','B','C','C'), person = c(1,2,3,4), str = c("door / window / table", "window / table / toilet / vase ", "TV / remote / phone / window", "book / vase / car / chair"))
Thus,
loc person str
1 A 1 door / window / table
2 B 2 window / table / toilet / vase
3 C 3 TV / remote / phone / window
4 C 4 book / vase / car / chair
And,
df.2 <- data.frame(loc = c('A','B','C'), str = c("book / chair / chair", " table / remote / vase ", "window"))
which gives,
loc str
1 A book / chair / car
2 B table / remote / vase
3 C window
I want to create a variable df.1$percentage
that calculates the percentages of elements in df.1$str
that are in df.2$str
edit by loc, or:
loc person str percentage
1 A 1 door / window / table 0.00
2 B 2 window / table / toilet / vase 0.50
3 C 3 TV / remote / phone / window 0.25
4 C 4 book / vase / car / chair 0.00
(1
has 0/3, 2
has 2/4 matches, 3
has 1/4, and 4
has 0/4)
Thanks!
We need to use a hash table to store the count of all occurrences of a character.So we know if a character occurs twice, then it will have 4 pairs – (i, i), (j, j), (i, j), (j, i). So using a hash function, store the occurrence of each character, then for each character the number of pairs will be occurrence^2.
Approach: Initialize a counter variable with 0. Iterate over the first string from the starting character to ending character. If the character extracted from the first string is found in the second string, then increment the value of the counter by 1.
Approach: Count the frequencies of all the characters from both strings. Now, for every character if the frequency of this character in string s1 is freq1 and in string s2 is freq2 then total valid pairs with this character will be min(freq1, freq2). The sum of this value for all the characters is the required answer.
Which function returns the number of matching characters of two string? The strcmp() function is used to compare two strings two strings str1 and str2 . If two strings are same then strcmp() returns 0 , otherwise, it returns a non-zero value.
As you might know, data.frame columns can also hold lists (see Create a data.frame where a column is a list). So you can split your str
into lists of words:
df.1 <- transform(df.1, words.1 = I(strsplit(as.character(str), " / ")))
df.2 <- transform(df.2, words.2 = I(strsplit(as.character(str), " / ")))
Then merge your data:
m <- merge(df.1, df.2, by = "loc")
And simply compute the percentage using mapply
:
transform(m, percentage = mapply(function(x, y) sum(x%in%y) / length(x),
words.1, words.2))
Someone can probably come up with a smarter solution, but here's a straightforward approach:
library(data.table)
dt1 = data.table(df.1, key = "loc") # set the key to match by loc
dt2 = data.table(df.2)
dt1[, percentage := dt1[dt2][, # merge
# clean up spaces and convert to strings
`:=`(str = gsub(" ", "", as.character(str)),
str.1 = gsub(" ", "", as.character(str.1)))][,
# calculate the percentage for each row
lapply(1:.N, function(i) {
tmp = strsplit(str, "/")[[i]];
sum(tmp %in% strsplit(str.1, "/")[[i]])/length(tmp)
})
]]
dt1
# loc person str percentage
#1: A 1 door / window / table 0
#2: B 2 window / table / toilet / vase 0.5
#3: C 3 TV / remote / phone / window 0.25
#4: C 4 book / vase / car / chair 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With