Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search for mispellings of a word in a character vector with R - "inverse" spell checker

I am text mining a large database to create indicator variables which indicate the occurrence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent.

However, there are some cases where the technicians misspelled a word, and so my grepl() function doesn't catch that the phrase (albeit mispelled) occurred in an observation. Ideally, I would like to be able to submit each word in a phrase to a function, which would return several common misspellings or typos of said word. Does such an R function exist?

With this, I could search for all possible combinations of these misspellings of the phrase in the comments field, and output that to another data frame. This way, I could look at each occurence on a case-by-case basis to determine if the phenomenon I am interested in was actually described by the technician.

I have Googled around, but have only found references to actual spell checking packages for R. What I am looking for is an "inverse" spell checker. Since the number of phrases I am looking for is relatively small, I would realistically be able to check for misspellings by hand; I just figured it would be nice to have this ability built into an R package for future text mining efforts.

Thank you for your time!

like image 934
Nick Evans Avatar asked Feb 01 '13 21:02

Nick Evans


1 Answers

As Gavin Simpson suggested, you can use aspell. I guess for this to work you need aspell installed. In many linux distributions it is by default; I don't know about other systems or whether it is installed with R.

See the following function for an example of use. It depends on your input data and what exactly you want to do with the result (e.g. correct misspelling with the first suggestion) which you didn't specify:

check_spelling <- function(text) {
  # Create a file with on each line one of the words we want to check
  text <- gsub("[,.]", "", text)
  text <- strsplit(text, " ", fixed=TRUE)[[1]]
  filename <- tempfile()
  writeLines(text, con = filename);
  # Check spelling of file using aspell
  result <- aspell(filename)
  # Extract list of suggestions from result
  suggestions <- result$Suggestions
  names(suggestions) <- result$Original
  unlink(filename)
  suggestions
}

> text <- "I am text mining a large database to create indicator variables which indicate the occurence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. "
> check_spelling(text)
$occurence
[1] "occurrence"   "occurrences"  "occurrence's"
like image 189
Jan van der Laan Avatar answered Oct 19 '22 07:10

Jan van der Laan