Search for mispellings of a word in a character vector with R - "inverse" spell checker

Question

I am text mining a large database to create indicator variables which indicate the occurrence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent.

However, there are some cases where the technicians misspelled a word, and so my grepl() function doesn't catch that the phrase (albeit mispelled) occurred in an observation. Ideally, I would like to be able to submit each word in a phrase to a function, which would return several common misspellings or typos of said word. Does such an R function exist?

With this, I could search for all possible combinations of these misspellings of the phrase in the comments field, and output that to another data frame. This way, I could look at each occurence on a case-by-case basis to determine if the phenomenon I am interested in was actually described by the technician.

I have Googled around, but have only found references to actual spell checking packages for R. What I am looking for is an "inverse" spell checker. Since the number of phrases I am looking for is relatively small, I would realistically be able to check for misspellings by hand; I just figured it would be nice to have this ability built into an R package for future text mining efforts.

Thank you for your time!

Jan van der Laan · Accepted Answer

As Gavin Simpson suggested, you can use aspell. I guess for this to work you need aspell installed. In many linux distributions it is by default; I don't know about other systems or whether it is installed with R.

See the following function for an example of use. It depends on your input data and what exactly you want to do with the result (e.g. correct misspelling with the first suggestion) which you didn't specify:

check_spelling <- function(text) {
  # Create a file with on each line one of the words we want to check
  text <- gsub("[,.]", "", text)
  text <- strsplit(text, " ", fixed=TRUE)[[1]]
  filename <- tempfile()
  writeLines(text, con = filename);
  # Check spelling of file using aspell
  result <- aspell(filename)
  # Extract list of suggestions from result
  suggestions <- result$Suggestions
  names(suggestions) <- result$Original
  unlink(filename)
  suggestions
}

> text <- "I am text mining a large database to create indicator variables which indicate the occurence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. "
> check_spelling(text)
$occurence
[1] "occurrence"   "occurrences"  "occurrence's"

Search for mispellings of a word in a character vector with R - "inverse" spell checker

Tags:

r

spell-checking

text-mining

tm

Nick Evans

1 Answers

Jan van der Laan

Recent Activity

Donate For Us

Search for mispellings of a word in a character vector with R - "inverse" spell checker

Tags:

r

spell-checking

text-mining

tm

Nick Evans

1 Answers

Jan van der Laan

Related questions

Recent Activity

Donate For Us