How to exactly remove the punctuation when using R with tm package

Question

Update:

I think I may have a workaround to solve this problem, just to add one code :dtms = removeSparseTerms(dtm,0.1) It will remove the sparse character in the corpus. But I think this is ONLY a workaround, still wait for experts' answer!

Recently I am learning text mining in R using tm package. And I have a idea to draw a word cloud about the words in my ABAP program in maximum frequency. So I wrote a R program to realize this.

library(tm)
library(SnowballC)
library(wordcloud)

# set path
path = system.file("texts","abapcode",package = "tm")

# make corpus
code = Corpus(DirSource(path),readerControl = list(language = "en"))

# cleanse text
code = tm_map(code,stripWhitespace)
code = tm_map(code,removeWords,stopwords("en"))
code = tm_map(code,removePunctuation)
code = tm_map(code,removeNumbers)

# make DocumentTermMatrix
dtm = DocumentTermMatrix(code)

#freqency 
freq = sort(colSums(as.matrix(dtm)),decreasing = T)

#wordcloud(code,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)
wordcloud(names(freq),freq,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)

But in my ABAP code, some variants contain "_" and "-" in the variant name, so if I executed this :

code = tm_map(code,removePunctuation)

The corpus content is not so correct and thus the word cloud is like this: enter image description here

Some words are so weird if remove the "_" or "-".

And then I comment that code and the word cloud is like this: enter image description here

This time the words are correct but some unexpected character popped up, such as my ABAP code commet...

So do we have some methods that can exactly remove the punctuation that we don't want and keep those we want?

Tensibai · Accepted Answer

Posting as answer for the code formatting, but it's an adaption from the documentation of content_transformer found from getTransformtions found in tm_map documentation :

Mainly it's using gsub in a content_transformer to do the same as removePunctuation minus _ and - (the [:punct:] regex class). removePunctuation has an option to keep dashes - but not to keep underscores _.

f <- content_transformer(function(x, pattern) gsub(pattern, "", x))
code <- tm_map(code, f, "[!\"#$%&'*+,./)(:;<=>?@\][\^`{|}~]")

In the character class you have to escape the \, the " and the closing bracket ].

Gaurav · Answer

Ok... so the following works... Convert the corpus into a data frame, remove the unwanted characters and subsequently reconvert to corpus...

dataframe<-data.frame(text=unlist(sapply(code,[, "content")), stringsAsFactors=F) dataframe$text <- gsub("[][!#$%()*,.:;<=>@^_|~.{}]", "", dataframe$text)

code <- corpus(Vectorsource(dataframe$text))

How to exactly remove the punctuation when using R with tm package

Tags:

r

punctuation

text-mining

customization

tm

Lisen

2 Answers

Tensibai

Gaurav

Recent Activity

Donate For Us

How to exactly remove the punctuation when using R with tm package

Tags:

r

punctuation

text-mining

customization

tm

Lisen

2 Answers

Tensibai

Gaurav

Related questions

Recent Activity

Donate For Us