I am using R to tokenize a set of texts; after tokenization I end up with a char vector in which punctuation signs, apostrophes and hyphens are preserved. For instance, I have this original text
txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"
After the tokenization (which I perform using scan_tokenizer
from package tm
) I get the following char vector
> vec1
[1] "this" "ain't" "a" "Hewlett-Packard"
[5] "box" "-" "it's" "an"
[9] "Apple" "box," "a" "very"
[13] "nice" "one!"
Now in order to get rid of the punctuation marks I do the following
vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)
This is, I substitute everything that is not alphanumerical values, spaces and apostrophes by ""; however this is the result
> vec2
[1] "this" "ain't" "a" "HewlettPackard" "box"
[6] "" "it's" "an" "Apple" "box"
[11] "a" "very" "nice" "one"
I want to preserve hyphenated words sych as "Hewlett-Pacakard", while getting rid of lone hyphens. Basically I need a regex to exclude hyphenated word of the form \w-\w
in the gsub
expression for vec2.
Your suggestions are much welcome
If you just wnat to remove "pure hyphens" then use the pattern '^-$'
(since the hyphen is not a regex meta-character.
vec2 <- vec1[!grepl( '^-$' , vec1) ]
If you wanted to remove "naked punctuation of all sorts" it might be:
vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With