Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Snowball Stemmer only stems last word

Tags:

r

stemming

tm

I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies
like image 563
Christian Avatar asked Aug 31 '11 21:08

Christian


People also ask

How does a snowball Stemmer work?

In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem 'care'.

How can we resolve the problem of stemming in NLP?

Following are the steps: 1) Convert the plural form of a word to its singular form. 2) Convert the past tense of a word to its present tense and remove the suffix 'ing'.

What is Snowball stemming?

Snowball is a small string processing programming language designed for creating stemming algorithms for use in information retrieval. The Snowball compiler translates a Snowball script (a . sbl file) into program in thread-safe ANSI C, Java, Ada, C#, Go, Javascript, Object Pascal, Python or Rust.

Which is the best Stemmer?

Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer.


1 Answers

load required libraries

library(tm)
library(Snowball)

create vector

vec<-c("running runner runs","happyness happies")

create corpus from vector

vec<-Corpus(VectorSource(vec))

very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

this will probably tell you Plain text document

So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

now we can use standard tm_map on our corpus

vec1 = tm_map(vec, stemDocumentfix)

result is

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.

like image 187
Abhinav Jain Avatar answered Sep 22 '22 07:09

Abhinav Jain