Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MetaPhone Functions (like SoundEx) functions and use in R?

I am wanting to use MetaPhone, Double Metaphone, Caverphone, MetaPhone3, SoundEx, and if anyone has done it yet NameX functions within 'R' so I can categorize and summarize like values to minimize data cleansing operations prior to analysis.

I am fully aware that each algorithm has its own strengths and weakness and would highly prefer not to use SoundEx but it still might work if I cannot find alternatives; as like mentioned in this post Harper would match with any of a list of unrelated names under SoundEx but should not in Metaphone for better result matching.

Though I am not sure which would serve my purposes best while still preserving some flexibility so that is the reason I want to take a stab with several of them as well as before looking at the values generate a table like the following.

enter image description here

Table Source Link

Surnames are not the subject of my initial analysis but think it is a good example as I want to effectively consider all like 'sounding' words treated as the same value is really what I am trying to do with a simply call something as values are evaluated.

Some things I have already looked at:

  • I know that a C package could be written and called with RCpp, and there are even C solutions for SoundEx on SE, but I have not written an R package before and looking to avoid re-inventing the wheel if there is a simpler way to do it directly in R or a package exists that has the function available?
  • I am aware that the RecordLinkage and now stringdist package have a SoundEx function, but not any form of a MetaPhone function.

So I am specifically looking for an answer is to how do a MetaPhone / Caverphone function in R and know the "Value" so I can group data values by them?

The additional caveat is I am still consider my self pretty new to R as I am not a daily user of it.

like image 968
CRSouser Avatar asked Jan 01 '15 00:01

CRSouser


2 Answers

The algorithm is pretty straightforward but I, too, could not find an existing R package. If you really need to do this work in R, one short-term option is to install the python module metaphone (pip install metaphone) then use the rPython bridge to use it in R:

library(rPython)

python.exec("from metaphone import doublemetaphone")
python.call("doublemetaphone", "architect")
[1] "ARKTKT" ""

It's not the most elegant solution, but it gets you metaphone operations in R.

The Apache Commons has a codec library that also implements the metaphone algorithms:

library(rJava)

.jinit() # need to have commons-codec-1.10.jar in your CLASSPATH

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
.jcall(mp,"S","metaphone", "architect")
[1] "ARXT"

You can make the above .jcall an R function and use it like any other R function:

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

sapply(c("abridgement", "stupendous"), metaphone)

## abridgement  stupendous 
##      "ABRJ"      "STPN"

The java interface may be more compatible across platforms, too.

Here's a more complete view of using the java interface:

library(rJava)

.jinit()

mp <- .jnew("org.apache.commons.codec.language.Metaphone")
dmp <- .jnew("org.apache.commons.codec.language.DoubleMetaphone")

metaphone <- function(x) {
  .jcall(mp,"S","metaphone", x)  
}

double_metaphone <- function(x) {
  .jcall(dmp,"S","doubleMetaphone", x)  
}

words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')

data.frame(metaphone=sapply(words, metaphone),
           double=sapply(words, double_metaphone))

##           metaphone double
## Catherine      K0RN   K0RN
## Katherine      K0RN   K0RN
## Katarina       KTRN   KTRN
## Johnathan      JN0N   JN0N
## Jonathan       JN0N   JN0N
## John             JN     JN
## Teresa          TRS    TRS
## Theresa         0RS    0RS
## Smith           SM0    SM0
## Smyth           SM0    SM0
## Jessica         JSK    JSK
## Joshua           JX     JX
like image 198
hrbrmstr Avatar answered Sep 24 '22 03:09

hrbrmstr


There is now an implementation of Double Metaphone in R in the package PGRdup.

install.packages(PGRdup)
library(PGRdup)
words <- c('Catherine', 'Katherine', 'Katarina', 'Johnathan', 
           'Jonathan', 'John', 'Teresa', 'Theresa', 'Smith', 
           'Smyth', 'Jessica', 'Joshua')
DoubleMetaphone(words)

$primary
 [1] "K0RN" "K0RN" "KTRN" "JN0N" "JN0N" "JN"   "TRS"  "0RS"  "SM0"  "SM0"  "JSK"  "JX"  

$alternate
 [1] "KTRN" "KTRN" "KTRN" "ANTN" "ANTN" "AN"   "TRS"  "TRS"  "XMT"  "XMT"  "ASK"  "AX"  
like image 45
Crops Avatar answered Sep 23 '22 03:09

Crops