Snowball Stemmer only stems last word

Tags:

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies

563

asked Aug 31 '11 21:08

Christian

1 Answers

load required libraries

library(tm)
library(Snowball)

create vector

vec<-c("running runner runs","happyness happies")

create corpus from vector

vec<-Corpus(VectorSource(vec))

very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

this will probably tell you Plain text document

So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

now we can use standard tm_map on our corpus

vec1 = tm_map(vec, stemDocumentfix)

result is

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.

187

answered Sep 22 '22 07:09

Abhinav Jain

Related questions
                            
                                Recover Rcpp source file
                            
                                Referring to package and function as arguments in another function
                            
                                ggplot2 fails to load, with 'rlang' package error
                            
                                Assigning plot to a variable in a loop
                            
                                Using colMeans in Rcpp
                            
                                Can we keep the caption at the top of plotly objects in html output from rmarkdown?
                            
                                Use R to Efficiently Order Randomly Generated Transects
                            
                                In dplyr 1.0.0, what is the right way to write a logical disjunction?
                            
                                Unlisting nested lists and plotting using ggplot
                            
                                Tidymodels tune_grid: "Can't subset columns that don't exist" when not using formula
                            
                                Pass a string as literal to create a graph object
                            
                                Lazy evaluation of supplied arguments
                            
                                using grid and ggplot2 to create join plots using R
                            
                                Model Fit statistics for a Logistic Regression
                            
                                How to sort dataframe in R with specified column order preservation?
                            
                                In R: Indexing vectors by boolean comparison of a value in range: index==c(min : max)
                            
                                R's behaviour using ifelse and eval in combination
                            
                                How to speed up summarise and ddply?
                            
                                Python IDLE equivalent of CTRL-R in R
                            
                                How to match vector values with colours from a colour ramp in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Snowball Stemmer only stems last word

Tags:

r

stemming

tm

Christian

People also ask

1 Answers

Abhinav Jain

Recent Activity

Donate For Us