Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing stopwords from a user-defined corpus in R

I have a set of documents:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

First I convert to a Corpus object:

documents <- Corpus(VectorSource(documents))

Then I try to remove the stopwords:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

But this last line results in the following error:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() to debug.

This has already been asked here but an answer was not given. What does this error mean?

EDIT

Yes, I am using the tm package.

Here is the output of sessionInfo():

R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit)

like image 274
StatsSorceress Avatar asked Dec 08 '22 22:12

StatsSorceress


2 Answers

When I run into tm problems I often end up just editing the original text.

For removing words it's a little awkward, but you can paste together a regex from tm's stopword list.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "
like image 130
Mhairi McNeill Avatar answered Feb 02 '23 20:02

Mhairi McNeill


Maybe try to use the tm_map function to transform the document. It seems to work in my case.

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

This yields

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "
like image 25
Ely Avatar answered Feb 02 '23 22:02

Ely