I am text mining a large database to create indicator variables which indicate the occurrence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent.
However, there are some cases where the technicians misspelled a word, and so my grepl() function doesn't catch that the phrase (albeit mispelled) occurred in an observation. Ideally, I would like to be able to submit each word in a phrase to a function, which would return several common misspellings or typos of said word. Does such an R function exist?
With this, I could search for all possible combinations of these misspellings of the phrase in the comments field, and output that to another data frame. This way, I could look at each occurence on a case-by-case basis to determine if the phenomenon I am interested in was actually described by the technician.
I have Googled around, but have only found references to actual spell checking packages for R. What I am looking for is an "inverse" spell checker. Since the number of phrases I am looking for is relatively small, I would realistically be able to check for misspellings by hand; I just figured it would be nice to have this ability built into an R package for future text mining efforts.
Thank you for your time!
As Gavin Simpson suggested, you can use aspell. I guess for this to work you need aspell installed. In many linux distributions it is by default; I don't know about other systems or whether it is installed with R.
See the following function for an example of use. It depends on your input data and what exactly you want to do with the result (e.g. correct misspelling with the first suggestion) which you didn't specify:
check_spelling <- function(text) {
# Create a file with on each line one of the words we want to check
text <- gsub("[,.]", "", text)
text <- strsplit(text, " ", fixed=TRUE)[[1]]
filename <- tempfile()
writeLines(text, con = filename);
# Check spelling of file using aspell
result <- aspell(filename)
# Extract list of suggestions from result
suggestions <- result$Suggestions
names(suggestions) <- result$Original
unlink(filename)
suggestions
}
> text <- "I am text mining a large database to create indicator variables which indicate the occurence of certain phrases in a comments field of an observation. The comments were entered by technicians, so the terms used are always consistent. "
> check_spelling(text)
$occurence
[1] "occurrence" "occurrences" "occurrence's"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With