I am doing a lot of analysis with the TM
package. One of my biggest problems are related to stemming and stemming-like transformations.
Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:
accounts -> account
account -> account
accounting -> account
acounting -> acount
acount -> acount
acounts -> acount
accounnt -> accounnt
Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.
1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?
2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?
Once again, any help/suggestions would be greatly appreciated.
The tm package in R provides the stemDocument() function to stem the document to it's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument. example: stemDocument(running,runs,ran) would return (run,run,ran) as the ouput.
R has a rich set of packages for Natural Language Processing (NLP) and generating plots. The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis.
Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions.
This question inspired me to attempt to write a spell check for the qdap
package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1
. That means you'll need the dev version at the moment.. here are the steps to install:
library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)
## Recreate a Corpus
like you describe.
terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")
fake_text <- unlist(lapply(terms, function(x) {
paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))
fake_text
inspect(myCorp <- Corpus(VectorSource(fake_text)))
## The interactive spell checker (check_spelling_interactive
)
m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))
The correct
function merely grabs a closure function from the output of check_spelling_interactive
and allows you to then apply the "correcting" to any new text string(s).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With