I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.
library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
readerControl = list(reader = readPlain, language = "en_US",
load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem
I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:
> vec<-c("running runner runs","happyness happies")
> stemDocument(vec)
[1] "running runner run" "happyness happi"
> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
[1] "run" "runner" "run" "happy" "happi" <-
> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
run runner run
[[2]]
happy happi
> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" , load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1.txt`
running runner runs
$`2.txt`
happyness happies
In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem 'care'.
Following are the steps: 1) Convert the plural form of a word to its singular form. 2) Convert the past tense of a word to its present tense and remove the suffix 'ing'.
Snowball is a small string processing programming language designed for creating stemming algorithms for use in information retrieval. The Snowball compiler translates a Snowball script (a . sbl file) into program in thread-safe ANSI C, Java, Ada, C#, Go, Javascript, Object Pascal, Python or Rust.
Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.
load required libraries
library(tm)
library(Snowball)
create vector
vec<-c("running runner runs","happyness happies")
create corpus from vector
vec<-Corpus(VectorSource(vec))
very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand
class(vec[[1]])
vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs
this will probably tell you Plain text document
So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.
stemDocumentfix <- function(x)
{
PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}
now we can use standard tm_map on our corpus
vec1 = tm_map(vec, stemDocumentfix)
result is
vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run
most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With