Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quotes and hyphens not removed by tm package functions while cleaning corpus

Tags:

r

text-mining

tm

I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.

like image 805
anonymous Avatar asked Jun 23 '15 05:06

anonymous


1 Answers

removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

Or you can go further and remove everything that is not alphanumerical symbol or space:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)
like image 130
cyberj0g Avatar answered Sep 25 '22 19:09

cyberj0g