I'm trying to clean the corpus and I've used the typical steps, like the code below:
docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)
Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated
It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.
Any help would be greatly appreciated.
removePunctuation
uses gsub('[[:punct:]]','',x)
i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_
{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:
removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)
Or you can go further and remove everything that is not alphanumerical symbol or space:
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With