Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically extracting strings with mismatched spellings from a column and replacing them in R [closed]

I have a huge dataset which is similar to the columns posted below

NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")

As you can see, in the Region column, the region "Pune" is spelled in two different ways- i.e "Pune" and "Poona".

Similarly, "Orissa" is spelled as "Orissa" and "Orisa".

I have multiple regions which are actually the same but are spelled in different ways. This will cause problems when I analyze the data.

I want to automatically be able to obtain a list of these mismatched spellings with the help of R.
I would also like to replace the spellings with the correct spellings automatically.

like image 815
Skurup Avatar asked Jul 24 '18 06:07

Skurup


2 Answers

I believe that you should use a phonetic code to determine which spellings are close to which.

A good choice is the soundex algorithm, implemented in several R packages. I will use package stringdist.

library(stringdist)

Region <- c("Pune", "Orissa", "Orisa", "Poone")
phonetic(Region)
#[1] "P500" "O620" "O620" "P500"

As you can see, Region[1] and Region[4] have the same soundex code. And the same for Region[2] and Region[3].

like image 126
Rui Barradas Avatar answered Oct 23 '22 10:10

Rui Barradas


Misspelling is hard to detect, event more when working with names.

I'll suggest using some string distance to detect how close two words are. You can easily do this with tidystringdist, which allows to get all the combinations from a vector, and then to perform all available string distance methods from stringdist:

Region <- c("Pune", "Orissa", "Orisa", "Poone")

library(tidystringdist)
library(magrittr)

tidy_comb_all(Region) %>%
  tidy_stringdist()
#> # A tibble: 6 x 12
#>   V1     V2      osa    lv    dl hamming   lcs qgram cosine jaccard     jw
#> * <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1 Pune   Oris…     6     6     6     Inf    10    10 1          1   1     
#> 2 Pune   Orisa     5     5     5     Inf     9     9 1          1   1     
#> 3 Pune   Poone     2     2     2     Inf     3     3 0.433      0.4 0.217 
#> 4 Orissa Orisa     1     1     1     Inf     1     1 0.0513     0   0.0556
#> 5 Orissa Poone     6     6     6     Inf    11    11 1          1   1     
#> 6 Orisa  Poone     5     5     5       5    10    10 1          1   1     
#> # ... with 1 more variable: soundex <dbl>

Created on 2018-07-24 by the reprex package (v0.2.0).

As you can see here, Pune and Poone have an osa, lv and dl distance of 2, and Orisa / Orissa a distance of 1, suggesting their spelling is very close.

When you have identified these, you can do the replacement.

like image 20
Colin FAY Avatar answered Oct 23 '22 11:10

Colin FAY