Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stemming words using tm package in R does not work properly?

I am doing some text mining (PCA, HC, K-Means) and so far I have managed to code everything right. However, there is a small flaw I'd like to fix.

When I try to stem my Corpus it does not work properly as there are different words with the same radical which aren't identified in the correct way. These are the words I am particularly interested in (it's in Spanish and they mean "kids" or related):

niñera, niños, niñas, niña, niño

But when I run the code I get that these words are still the same except for

niña, niño --> niñ 

But the other remain the same so I end up only stemming for niña/niño but not for the others.

This is my code for creating the corpus:

corp <- Corpus(DataframeSource(data.frame(x$service_name)))
docs <- tm_map(corp, removePunctuation)
docs <- tm_map(docs, removeNumbers) 
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, stemDocument, language = "spanish") 
docs <- tm_map(docs, PlainTextDocument) 
dtm <- DocumentTermMatrix(docs)   
dtm  

I'd really appreciate some suggestions! Thank you

like image 279
adrian1121 Avatar asked May 01 '16 14:05

adrian1121


2 Answers

It seems that the stemming transform can only be applied to PlainTextDocument types. See ? stemDocument.

sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers) 
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, PlainTextDocument)  # needs to come before stemming
docs <- tm_map(docs, stemDocument, "spanish")
print(docs[[1]]$content)

# " niñer  niñ  niñ  niñ  niñ"

versus

# WRONG
sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers) 
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, stemDocument, "spanish")  # WRONG: apply PlainTextDocument first
docs <- tm_map(docs, PlainTextDocument)  
print(docs[[1]]$content)

# " niñera  niños  niñas  niña  niñ"

In my opinion, this detail is not obvious and it'd be nice to get at least a warning when stemDocument is invoked on a non-PlainTextDocument.

like image 197
Ryan Walker Avatar answered Oct 17 '22 21:10

Ryan Walker


I changed from

corpus <- tm_map(corpus, tolower) 

to

corpus <- tm_map(corpus, content_transformer(tolower))

and then stemDocument worked.

like image 1
ResearchBigD Avatar answered Oct 17 '22 22:10

ResearchBigD