I'm dealing with a large amount of data, mostly names with non-English characters. My goal is to match these names against some information on them collected in the USA.
ie, I might want to match the name 'Sølvsten' (from some list of names) to 'Soelvsten' (the name as stored in some American database). Here is a function I wrote to do this. It's clearly clunky and somewhat arbitrary, but I wonder if there is a simple R function that translates these foreign characters to their nearest English neighbours. I understand that there might not be any standard way to do this conversion, but I'm just curious if there is and if that conversion can be done through an R function.
# a function to replace foreign characters
replaceforeignchars <- function(x)
{
require(gsubfn);
x <- gsub("š","s",x)
x <- gsub("œ","oe",x)
x <- gsub("ž","z",x)
x <- gsub("ß","ss",x)
x <- gsub("þ","y",x)
x <- gsub("à","a",x)
x <- gsub("á","a",x)
x <- gsub("â","a",x)
x <- gsub("ã","a",x)
x <- gsub("ä","a",x)
x <- gsub("å","a",x)
x <- gsub("æ","ae",x)
x <- gsub("ç","c",x)
x <- gsub("è","e",x)
x <- gsub("é","e",x)
x <- gsub("ê","e",x)
x <- gsub("ë","e",x)
x <- gsub("ì","i",x)
x <- gsub("í","i",x)
x <- gsub("î","i",x)
x <- gsub("ï","i",x)
x <- gsub("ð","d",x)
x <- gsub("ñ","n",x)
x <- gsub("ò","o",x)
x <- gsub("ó","o",x)
x <- gsub("ô","o",x)
x <- gsub("õ","o",x)
x <- gsub("ö","o",x)
x <- gsub("ø","oe",x)
x <- gsub("ù","u",x)
x <- gsub("ú","u",x)
x <- gsub("û","u",x)
x <- gsub("ü","u",x)
x <- gsub("ý","y",x)
x <- gsub("ÿ","y",x)
x <- gsub("ğ","g",x)
return(x)
}
Note: I know there exist name matching algorithms such as Jaro Winkler Distance Matching, but I'd rather do exact matches.
In the meantime, you can also use stri_trans_general()
from the stringi package.
library(stringi)
x <- c("š", "ž", "ğ", "ß", "þ", "à", "á", "â", "ã", "ä", "å", "æ",
"ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "ø", "œ", "ù", "ú", "û", "ü", "ý", "ÿ")
y <- stri_trans_general(x, "Latin-ASCII")
data.frame(x, y, stringsAsFactors = FALSE)
#> x y
#> 1 š s
#> 2 ž z
#> 3 ğ g
#> 4 ß ss
#> 5 þ th
#> 6 à a
#> 7 á a
#> 8 â a
#> 9 ã a
#> 10 ä a
#> 11 å a
#> 12 æ ae
#> 13 ç c
#> 14 è e
#> 15 é e
#> 16 ê e
#> 17 ë e
#> 18 ì i
#> 19 í i
#> 20 î i
#> 21 ï i
#> 22 ð d
#> 23 ñ n
#> 24 ò o
#> 25 ó o
#> 26 ô o
#> 27 õ o
#> 28 ö o
#> 29 ø o
#> 30 œ oe
#> 31 ù u
#> 32 ú u
#> 33 û u
#> 34 ü u
#> 35 ý y
#> 36 ÿ y
Note that this converts “ø” to “o”, however.
stri_trans_general("Sølvsten", "Latin-ASCII")
#> [1] "Solvsten"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With