Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining two datasets using fuzzy logic

I’m trying to do a fuzzy logic join in R between two datasets:

  • first data set has the name of a location and a column called config
  • second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set.

I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for example if we looked at these two data sets, I'd like the name OPAL to join to the OPALAS, and SAUSALITO Y to join to SAUSALITO.

Dataset1:    
     Name           Config
     ALTO D         BB
     CONTRA         ST
     EIGHT A        DD
     OPALAS         BB
     SAUSALITO Y    AA
     SOLANO J       ST

Dataset2:    
    Name       Age     Rank
    ALTO D     50      2
    ALTO D     20      6
    CONTRA     10      10
    CONTRA     15      15
    EIGHTH     18      21
    OPAL       19      4
    SAUSALITO  2       12
    SOLANO     34      43

Code for Summarizing of Dataset2

Data2a <- summaryBy(Age ~ Name,FUN=c(mean), data=Data2,na.rm=TRUE)
Data2b <- summaryBy(Rank ~ Name,FUN=c(sum), data=Data2,na.rm=TRUE)
Data2 <- data.frame(Data2a$Name, Data2a$Age.mean, Data2b$Rank.sum)

Desired Outcome:
    Name        Config  Age   Rank
    ALTO D      BB      35    8
    CONTRA      ST      12.5  25
    EIGHT A     DD      18    21
    OPALAS      BB      19    4
    SAUSALITO Y AA      12    5
    SOLANO J    ST      34    43
like image 644
steppermotor Avatar asked Jun 04 '17 23:06

steppermotor


1 Answers

I was able to join the two datasets, using the fuzzyjoin package:

library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2,
     by ="Name", distance_col = NULL)
like image 65
steppermotor Avatar answered Oct 19 '22 14:10

steppermotor