I am doing some text mining (PCA, HC, K-Means) and so far I have managed to code everything right. However, there is a small flaw I'd like to fix.
When I try to stem my Corpus it does not work properly as there are different words with the same radical which aren't identified in the correct way. These are the words I am particularly interested in (it's in Spanish and they mean "kids" or related):
niñera, niños, niñas, niña, niño
But when I run the code I get that these words are still the same except for
niña, niño --> niñ
But the other remain the same so I end up only stemming for niña/niño but not for the others.
This is my code for creating the corpus:
corp <- Corpus(DataframeSource(data.frame(x$service_name)))
docs <- tm_map(corp, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, stemDocument, language = "spanish")
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
dtm
I'd really appreciate some suggestions! Thank you
It seems that the stemming transform can only be applied to PlainTextDocument types. See ? stemDocument
.
sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, PlainTextDocument) # needs to come before stemming
docs <- tm_map(docs, stemDocument, "spanish")
print(docs[[1]]$content)
# " niñer niñ niñ niñ niñ"
versus
# WRONG
sp.corpus = Corpus(VectorSource(c("la niñera. los niños. las niñas. la niña. el niño.")))
docs <- tm_map(sp.corpus, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("spanish"))
docs <- tm_map(docs, stemDocument, "spanish") # WRONG: apply PlainTextDocument first
docs <- tm_map(docs, PlainTextDocument)
print(docs[[1]]$content)
# " niñera niños niñas niña niñ"
In my opinion, this detail is not obvious and it'd be nice to get at least a warning when stemDocument is invoked on a non-PlainTextDocument.
I changed from
corpus <- tm_map(corpus, tolower)
to
corpus <- tm_map(corpus, content_transformer(tolower))
and then stemDocument
worked.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With