Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex expression to exclude hyphenated words in R

Tags:

regex

r

I am using R to tokenize a set of texts; after tokenization I end up with a char vector in which punctuation signs, apostrophes and hyphens are preserved. For instance, I have this original text

txt <- "this ain't a Hewlett-Packard box - it's an Apple box, a very nice one!"

After the tokenization (which I perform using scan_tokenizer from package tm) I get the following char vector

   > vec1
 [1] "this"            "ain't"           "a"               "Hewlett-Packard"
 [5] "box"             "-"               "it's"            "an"             
 [9] "Apple"           "box,"            "a"               "very"           
[13] "nice"            "one!"           

Now in order to get rid of the punctuation marks I do the following

vec2 <- gsub("[^[:alnum:][:space:]']", "", vec1)

This is, I substitute everything that is not alphanumerical values, spaces and apostrophes by ""; however this is the result

> vec2
 [1] "this"           "ain't"          "a"              "HewlettPackard" "box"           
 [6] ""               "it's"           "an"             "Apple"          "box"           
[11] "a"              "very"           "nice"           "one"    

I want to preserve hyphenated words sych as "Hewlett-Pacakard", while getting rid of lone hyphens. Basically I need a regex to exclude hyphenated word of the form \w-\w in the gsub expression for vec2.

Your suggestions are much welcome

like image 846
Jose Manuel Albornoz Avatar asked Dec 05 '22 21:12

Jose Manuel Albornoz


1 Answers

If you just wnat to remove "pure hyphens" then use the pattern '^-$' (since the hyphen is not a regex meta-character.

vec2 <- vec1[!grepl( '^-$' , vec1) ]

If you wanted to remove "naked punctuation of all sorts" it might be:

vec2 <- vec1[!grepl( '^[[:punct:]]$' , vec1) ]
like image 177
IRTFM Avatar answered Dec 19 '22 16:12

IRTFM