I am doing a lot of analysis with the <code>TM</code> package. One of my biggest problems are related to stemming and stemming-like transformations. Let's say I have several accounting related terms (I am aware of the spelling issues). After stemming we have: <pre class="prettyprint"><code>accounts -> account account -> account accounting -> account acounting -> acount acount -> acount acounts -> acount accounnt -> accounnt </code></pre> Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term. 1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible? 2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R? Once again, any help/suggestions would be greatly appreciated.

This question inspired me to attempt to write a spell check for the <code>qdap</code> package. There's an interactive version that may be useful here. It's available in <code>qdap >= version 2.1.1</code>. That means you'll need the dev version at the moment.. here are the steps to install: <pre class="prettyprint"><code>library(devtools) install_github("qdapDictionaries", "trinker") install_github("qdap", "trinker") library(tm); library(qdap) </code></pre> ## Recreate a <code>Corpus</code> like you describe. <pre class="prettyprint"><code>terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt") fake_text <- unlist(lapply(terms, function(x) { paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ") })) fake_text inspect(myCorp <- Corpus(VectorSource(fake_text))) </code></pre> ## The interactive spell checker (<code>check_spelling_interactive</code>) <pre class="prettyprint"><code>m <- check_spelling_interactive(as.data.frame(myCorp)[[2]]) preprocessed(m) inspect(myCorp <- tm_map(myCorp, correct(m))) </code></pre> The <code>correct</code> function merely grabs a closure function from the output of <code>check_spelling_interactive</code> and allows you to then apply the "correcting" to any new text string(s).

Stemming with R Text Analysis

Tags:

text

r

stemming

tm

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:

accounts   -> account  
account    -> account  
accounting -> account  
acounting  -> acount  
acount     -> acount  
acounts    -> acount  
accounnt   -> accounnt

Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.

1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?

2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?

Once again, any help/suggestions would be greatly appreciated.

711

asked Jun 27 '14 03:06

RUser

1 Answers

This question inspired me to attempt to write a spell check for the qdap package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1. That means you'll need the dev version at the moment.. here are the steps to install:

library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)

## Recreate a Corpus like you describe.

terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")

fake_text <- unlist(lapply(terms, function(x) {
    paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))

fake_text

inspect(myCorp <- Corpus(VectorSource(fake_text)))

## The interactive spell checker (check_spelling_interactive)

m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))

The correct function merely grabs a closure function from the output of check_spelling_interactive and allows you to then apply the "correcting" to any new text string(s).

103

answered Oct 25 '22 09:10

Tyler Rinker

Related questions
                            
                                R pass function in as variable
                            
                                Subset based on list of strings using grepl()?
                            
                                Count occurrences of factor in R, with zero counts reported
                            
                                Change column position of data.table
                            
                                Shifting non-NA cells to the left
                            
                                Error in XLConnect
                            
                                Using Prophet Package to Predict By Group in Dataframe in R
                            
                                Identifying positions of the last TRUEs in a sequence of TRUEs and FALSEs
                            
                                Understanding glm$residuals and resid(glm)
                            
                                How to create factors from factanal?
                            
                                Object not found error when passing model formula to another function
                            
                                sum of S4 objects in R
                            
                                R Dynamically build "list" in data.table (or ddply)
                            
                                ggplot: Boxplot of multiple column values
                            
                                What's wrong with as.numeric in R? [duplicate]
                            
                                How can I install qpdf on Mac 10.8.3?
                            
                                Rounding numbers to nearest 10 in R
                            
                                Combining random forests built with different training sets in R
                            
                                Harvey balls in R
                            
                                Why does expand.grid ignore options?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With