I have a tbl_df and want to see the percentage of matching words between two strings.
The data looks like this:
# A tibble 3 x 2
X Y
<chr> <chr>
1 "mary smith" "mary smith"
2 "mary smith" "john smith"
3 "mike williams" "jack johnson"
Desired output (% by row, in any order):
# A tibble 3 x 3
X Y Z
<chr> <chr> <dbl>
1 "mary smith" "mary smith" 1.0
2 "mary smith" "john smith" 0.50
3 "mike williams" "jack johnson" 0.0
A base R
option would be to check for the length
of common words (intesect
) after split
ting the column by space and divide the length
df1$Z <- mapply(function(x, y) length(intersect(x, y))/length(x),
strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0
Or in tidyverse
, we can use map2
and apply the same logic
library(tidyverse)
df1 %>%
mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~
length(intersect(.x, .y))/length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
df1 <- structure(list(X = c("mary smith", "mary smith", "mike williams"
), Y = c("mary smith", "john smith", "jack johnson")), .Names = c("X",
"Y"), class = "data.frame", row.names = c("1", "2", "3"))
Here is a tidyverse
option using stringr::str_split
library(dplyr)
library(stringr)
df %>%
mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
# X Y Z
#1 mary smith mary smith 1
#2 mary smith john smith 0.5
#3 mike williams jack johnson 0
or using stringi::stri_extract_all_words
library(stringi)
df %>%
mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))
df <- read.table(text =
' X Y
"mary smith" "mary smith"
"mary smith" "john smith"
"mike williams" "jack johnson"', header = T)
Try using stringsim()
in stringdist
package:
library(stringdist)
tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
y = c("mary smith", "john smith", "jack johnson"))
# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))
# jw = jaro-winkler
tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.600
## 3 mike williams jack johnson 0.0769
## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
## x y z
## <chr> <chr> <dbl>
## 1 mary smith mary smith 1.00
## 2 mary smith john smith 0.733
## 3 mike williams jack johnson 0.494
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With