Automatically extracting strings with mismatched spellings from a column and replacing them in R [closed]

Question

I have a huge dataset which is similar to the columns posted below

NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")

As you can see, in the Region column, the region "Pune" is spelled in two different ways- i.e "Pune" and "Poona".

Similarly, "Orissa" is spelled as "Orissa" and "Orisa".

I have multiple regions which are actually the same but are spelled in different ways. This will cause problems when I analyze the data.

I want to automatically be able to obtain a list of these mismatched spellings with the help of R.
I would also like to replace the spellings with the correct spellings automatically.

Rui Barradas · Accepted Answer

I believe that you should use a phonetic code to determine which spellings are close to which.

A good choice is the soundex algorithm, implemented in several R packages. I will use package stringdist.

library(stringdist)

Region <- c("Pune", "Orissa", "Orisa", "Poone")
phonetic(Region)
#[1] "P500" "O620" "O620" "P500"

As you can see, Region[1] and Region[4] have the same soundex code. And the same for Region[2] and Region[3].

Colin FAY · Answer

Misspelling is hard to detect, event more when working with names.

I'll suggest using some string distance to detect how close two words are. You can easily do this with tidystringdist, which allows to get all the combinations from a vector, and then to perform all available string distance methods from stringdist:

Region <- c("Pune", "Orissa", "Orisa", "Poone")

library(tidystringdist)
library(magrittr)

tidy_comb_all(Region) %>%
  tidy_stringdist()
#> # A tibble: 6 x 12
#>   V1     V2      osa    lv    dl hamming   lcs qgram cosine jaccard     jw
#> * <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1 Pune   Oris…     6     6     6     Inf    10    10 1          1   1     
#> 2 Pune   Orisa     5     5     5     Inf     9     9 1          1   1     
#> 3 Pune   Poone     2     2     2     Inf     3     3 0.433      0.4 0.217 
#> 4 Orissa Orisa     1     1     1     Inf     1     1 0.0513     0   0.0556
#> 5 Orissa Poone     6     6     6     Inf    11    11 1          1   1     
#> 6 Orisa  Poone     5     5     5       5    10    10 1          1   1     
#> # ... with 1 more variable: soundex <dbl>

Created on 2018-07-24 by the reprex package (v0.2.0).

As you can see here, Pune and Poone have an osa, lv and dl distance of 2, and Orisa / Orissa a distance of 1, suggesting their spelling is very close.

When you have identified these, you can do the replacement.

Automatically extracting strings with mismatched spellings from a column and replacing them in R [closed]

Tags:

string

r

text-analysis

Skurup

2 Answers

Rui Barradas

Colin FAY

Recent Activity

Donate For Us

Automatically extracting strings with mismatched spellings from a column and replacing them in R [closed]

Tags:

string

r

text-analysis

Skurup

2 Answers

Rui Barradas

Colin FAY

Related questions

Recent Activity

Donate For Us