Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with wrong spelling when matching text strings in R

I'm collecting survey data (using open data kit), and my field team, bless their hearts, sometimes get a bit creative with the spelling of people's names. So I have a "correct" respondent name, as well as an age variable for some of the records that is linked to a "family member name" variable. There are many family members with different ages. I want respondent age.

Here is some fake data that illustrates my problem:

#the respondent
    r = data.frame(name = c("Barack Obama", "George Bush", "Hillary Clinton"))
#a male member
    m = data.frame(name = c("Barack Obama","George", "Wulliam Clenton"), age = c(55,59,70)); m$name=as.character(m$name)
#a female member
    f = data.frame(name = c("Michelle O","Laura Busch", "Hillary Rodham Clinton"), age = c(54,58,69)); f$name=as.character(f$name)
#if the responsent is the the given member, record their age.  if not, NA
    a = cbind(
        ifelse(r$name==m$name,m$age,NA)
        ,ifelse(r$name==f$name,f$age,NA)
        )
    #make a function for plyr that gives me the age of the matched respondent
    f = function(row){
        d = row[is.na(row)==0]
        ifelse(length(d)==0,NA,d)
        }
    require(plyr)
    b = aaply(a,.margins=1,.fun=f)
    data.frame(names=r$name,age=b)
                names age
    1    Barack Obama  55
    2     George Bush  NA
    3 Hillary Clinton  NA

    what.I.would.like = data.frame(names=c("Barack Obama", "George Bush", "Hillary Clinton"),age = c(55,59,70))
    1> what.I.would.like
                names age
    1    Barack Obama  55
    2     George Bush  59
    3 Hillary Clinton  70

in my real data, I've got hundreds of people and up to 13 family members. I've since changed the survey to record respondent age separately, but I've got a mess of data to clean.

like image 200
generic_user Avatar asked May 30 '13 13:05

generic_user


2 Answers

I recommend you use Jaro-Winkler distance, a string similarity metric developed to solve this exact problem in US census data. It is more sophisticated than levenshtein distance and designed specifically for working with names. You can find an R implementation in the RecordLinkage package. You'll need to set a cut-off threshhold (e.g. 0.8) for how similar two strings must be.

install.packages('RecordLinkage','RSQLite')
require(RecordLinkage)

jarowinkler('William Clinton', "Willam Clntn")
# 0.96
jarowinkler('William Clinton', "Wuliam Clinton")
# 0.8462637
jarowinkler('William Clinton', "Hilary Clinton")
# 0.7790765

I'd recommend setting a reasonably high threshhold (maybe 0.9) for automated matching and then sending records below the high threshhold but above a secondary lower thresshhold (maybe 0.7) to human review. You should play with these numbers and see what works for you. These values will determine your sensitivity/specificity trade-off.

like image 92
David Marx Avatar answered Nov 19 '22 00:11

David Marx


The spelling problems are commonly dealt with by using some variant of the soundex algorithm. There is an R implementation in RecordLinkage package. Then you need to compare not the strings themselves but their "phonetic codes":

> soundex('Clenton') == soundex('Clinton')
[1] TRUE

UPDATE: There is also another way of determining if two words are "close" to each other - which is a "distance" is some sense between the words. One standard measure of the distances is the minimum amount of one-letter replacements, deletions and insertions needed to transform the first word into the second one. It is called Levenshtein distance. RecordLinkage as well as vwr package have the appropriate functions:

> levenshteinDist('Clinton', 'Clenton')
[1] 1

> vwr::levenshtein.distance('Clinton', 'Clenton')
Clenton 
  1 

Then you can use the distances and consider the words "close" enough if the distance does not exceed some threshold.

UPDATE: soundexis also available in phonics package.

like image 20
Alex Vorobiev Avatar answered Nov 19 '22 00:11

Alex Vorobiev