In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation()
function, currently defined as:
function(x, preserve_intra_word_dashes = FALSE)
{
if (!preserve_intra_word_dashes)
gsub("[[:punct:]]+", "", x)
else {
# Assume there are no ASCII 1 characters.
x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x)
x <- gsub("[[:punct:]]+", "", x)
gsub("\1", "-", x, fixed = TRUE)
}
}
I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at word boundaries. There are the usual ASCII punctuation characters, but also a few Unicode Dashes, Unicode Quotes, Math Symbols...
There are also URLs in the text, and there the punctuation the intra-word punctuation characters need to be preserved. tm's built-in removePunctuation()
function is too radical.
So I need a custom removePunctuation()
function to do removal according to my requirements.
My custom Unicode function looks like this now, but it does not work as expected. I am using R only rarely, so getting things done in R takes some time, even for the simplest tasks.
My function:
corpus <- tm_map(corpus, rmPunc = function(x){
# lookbehinds
# need to be careful to specify fixed-width conditions
# so that it can be used in lookbehind
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{5})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{4})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{3})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>]{2})([[:alnum:]])'," \\2", x, perl=TRUE) ;
x <- gsub('(.*?)(?<=^[[:punct:]’“”:±</>])([[:alnum:]])'," \\2", x, perl=TRUE) ;
# lookaheads (can use variable-width conditions)
x <- gsub('(.*?)(?=[[:alnum:]])([[:punct:]’“”:±]+)$',"\1 ", x, perl=TRUE) ;
# remove all strings that consist *only* of punct chars
gsub('^[[:punct:]’“”:±</>]+$',"", x, perl=TRUE) ;
}
It does not work as expected. I think, it doesn't do anything at all. The punctuation is still inside the terms-document matrix, see:
head(Terms(tdm), n=30)
[1] "<></>" "---"
[3] "--," ":</>"
[5] ":()" "/)."
[7] "/++" "/++,"
[9] "..," "..."
[11] "...," "..)"
[13] "“”," "(|)"
[15] "(/)" "(.."
[17] "(..," "()=(|=)."
[19] "()," "()."
[21] "(&)" "++,"
[23] "(0°" "0.001),"
[25] "0.003" "=0.005)"
[27] "0.006" "=0.007)"
[29] "000km" "0.01)"
...
So my questions are:
\P{ASCII}
or \P{PUNCT}
supported in R's perl-compatible regular
expressions? I think they aren't (by default) by PCRE:: " Only the support for various Unicode properties with \p is incomplete, though the most important ones are supported."As much as I like Susana's answer it is breaking the Corpus in newer versions of tm (No longer a PlainTextDocument and destroying the meta)
You will get a list and the following error:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
Using
tm_map(your_corpus, PlainTextDocument)
will give you back your corpus but with broken $meta (in particular document ids will be missing.
Solution
Use content_transformer
toSpace <- content_transformer(function(x,pattern)
gsub(pattern," ", x))
your_corpus <- tm_map(your_corpus,toSpace,"„")
Source: Hands-On Data Science with R, Text Mining, [email protected] http://onepager.togaware.com/
This function removes everything that is not alpha numeric (i.e. UTF-8 emoticons etc.)
removeNonAlnum <- function(x){
gsub("[^[:alnum:]^[:space:]]","",x)
}
I had the same problem, custom function was not working, but actually the first line below has to be added
Regards
Susana
replaceExpressions <- function(x) UseMethod("replaceExpressions", x)
replaceExpressions.PlainTextDocument <- replaceExpressions.character <- function(x) {
x <- gsub(".", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(",", " ", x, ignore.case =FALSE, fixed = TRUE)
x <- gsub(":", " ", x, ignore.case =FALSE, fixed = TRUE)
return(x)
}
notes_pre_clean <- tm_map(notes, replaceExpressions, useMeta = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With