Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stemming with R Text Analysis

Tags:

text

r

stemming

tm

I am doing a lot of analysis with the TM package. One of my biggest problems are related to stemming and stemming-like transformations.

Let's say I have several accounting related terms (I am aware of the spelling issues).
After stemming we have:

accounts   -> account  
account    -> account  
accounting -> account  
acounting  -> acount  
acount     -> acount  
acounts    -> acount  
accounnt   -> accounnt  

Result: 3 Terms (account, acount, account) where I would have liked 1 (account) as all these relate to the same term.

1) To correct spelling is a possibility, but I have never attempted that in R. Is that even possible?

2) The other option is to make a reference list i.e. account = (accounts, account, accounting, acounting, acount, acounts, accounnt) and then replace all occurrences with the master term. How would I do this in R?

Once again, any help/suggestions would be greatly appreciated.

like image 711
RUser Avatar asked Jun 27 '14 03:06

RUser


People also ask

What is text stemming in R?

The tm package in R provides the stemDocument() function to stem the document to it's root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument. example: stemDocument(running,runs,ran) would return (run,run,ran) as the ouput.

Can R do text analysis?

R has a rich set of packages for Natural Language Processing (NLP) and generating plots. The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis.

Which package is used for stemming in Text Mining in R?

Different data-mining activities, such as data processing, supervised and unsupervised learning, association mining, and so on, can be performed using the RWeka package. For natural language processing, RWeka provides tokenization and stemming functions.


1 Answers

This question inspired me to attempt to write a spell check for the qdap package. There's an interactive version that may be useful here. It's available in qdap >= version 2.1.1. That means you'll need the dev version at the moment.. here are the steps to install:

library(devtools)
install_github("qdapDictionaries", "trinker")
install_github("qdap", "trinker")
library(tm); library(qdap)

## Recreate a Corpus like you describe.

terms <- c("accounts", "account", "accounting", "acounting", "acount", "acounts", "accounnt")

fake_text <- unlist(lapply(terms, function(x) {
    paste(sample(c(x, sample(DICTIONARY[[1]], sample(1:5, 1)))), collapse=" ")
}))

fake_text

inspect(myCorp <- Corpus(VectorSource(fake_text)))

## The interactive spell checker (check_spelling_interactive)

m <- check_spelling_interactive(as.data.frame(myCorp)[[2]])
preprocessed(m)
inspect(myCorp <- tm_map(myCorp, correct(m)))

The correct function merely grabs a closure function from the output of check_spelling_interactive and allows you to then apply the "correcting" to any new text string(s).

like image 103
Tyler Rinker Avatar answered Oct 25 '22 09:10

Tyler Rinker