Quotes and hyphens not removed by tm package functions while cleaning corpus

Question

I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.

cyberj0g · Accepted Answer

removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

Or you can go further and remove everything that is not alphanumerical symbol or space:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)

Quotes and hyphens not removed by tm package functions while cleaning corpus

Tags:

r

text-mining

tm

anonymous

1 Answers

cyberj0g

Recent Activity

Donate For Us

Quotes and hyphens not removed by tm package functions while cleaning corpus

Tags:

r

text-mining

tm

anonymous

1 Answers

cyberj0g

Related questions

Recent Activity

Donate For Us