Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add column with percentage of matching words in two different columns (by row) in R

I have a tbl_df and want to see the percentage of matching words between two strings.

The data looks like this:

# A tibble 3 x 2
       X                 Y
     <chr>             <chr>
1 "mary smith"      "mary smith"
2 "mary smith"      "john smith"
3 "mike williams"   "jack johnson"

Desired output (% by row, in any order):

# A tibble 3 x 3 
       X               Y           Z 
     <chr>           <chr>        <dbl>
1 "mary smith"    "mary smith"     1.0 
2 "mary smith"    "john smith"     0.50 
3 "mike williams" "jack johnson"   0.0
like image 567
bkt619 Avatar asked Aug 09 '18 13:08

bkt619


3 Answers

A base R option would be to check for the length of common words (intesect) after splitting the column by space and divide the length

df1$Z <- mapply(function(x, y)  length(intersect(x, y))/length(x), 
            strsplit(df1$X, " "), strsplit(df1$Y, " "))
df$Z
#[1] 1.0 0.5 0.0

Or in tidyverse, we can use map2 and apply the same logic

library(tidyverse)
df1 %>% 
  mutate(Z = map2(strsplit(X, " "), strsplit(Y, " "), ~ 
                       length(intersect(.x, .y))/length(.x)))
 #             X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

data

df1 <- structure(list(X = c("mary smith", "mary smith", "mike williams"
), Y = c("mary smith", "john smith", "jack johnson")), .Names = c("X", 
"Y"), class = "data.frame", row.names = c("1", "2", "3"))
like image 102
akrun Avatar answered Nov 15 '22 06:11

akrun


Here is a tidyverse option using stringr::str_split

library(dplyr)
library(stringr)
df %>%
    mutate(Z = map2(str_split(X, " "), str_split(Y, " "), ~sum(.x == .y) / length(.x)))
#              X            Y   Z
#1    mary smith   mary smith   1
#2    mary smith   john smith 0.5
#3 mike williams jack johnson   0

or using stringi::stri_extract_all_words

library(stringi)
df %>%
    mutate(Z = map2(stri_extract_all_words(X), stri_extract_all_words(Y), ~sum(.x == .y) / length(.x)))

Sample data

df <- read.table(text =
    '       X                 Y
 "mary smith"      "mary smith"
 "mary smith"      "john smith"
 "mike williams"   "jack johnson"', header = T)
like image 41
Maurits Evers Avatar answered Nov 15 '22 07:11

Maurits Evers


Try using stringsim() in stringdist package:

library(stringdist)

tbl <- tibble(x = c("mary smith", "mary smith", "mike williams"),
              y = c("mary smith", "john smith", "jack johnson"))

# lv = levenshtein distance
tbl %>% mutate(z = stringsim(x, y, method ='lv'))

# jw =  jaro-winkler 
tbl %>% mutate(z = stringsim(x, y, method ='jw'))

## > tbl %>% mutate(z = stringsim(x, y, method ='lv'))
## # A tibble: 3 x 3
##  x             y                 z
##  <chr>         <chr>         <dbl>
## 1 mary smith    mary smith   1.00  
## 2 mary smith    john smith   0.600 
## 3 mike williams jack johnson 0.0769

## > tbl %>% mutate(z = stringsim(x, y, method ='jw'))
## # A tibble: 3 x 3
##   x             y                z
##  <chr>         <chr>        <dbl>
## 1 mary smith    mary smith   1.00 
## 2 mary smith    john smith   0.733
## 3 mike williams jack johnson 0.494
like image 1
ryanhnkim Avatar answered Nov 15 '22 07:11

ryanhnkim