Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when [:punct:] is too much [duplicate]

Tags:

regex

r

I'm cleaning text strings in R. I want to remove all the punctuation except apostrophes and hyphens. This means I can't use the [:punct:] character class (unless there's a way of saying [:punct:] but not '-).

! " # $ % & ( ) * + , . / : ; < = > ? @ [ \ ] ^ _ { | } ~. and backtick must come out.

For most of the above, escaping is not an issue. But for square brackets, I'm really having issues. Here's what I've tried:

gsub('[abc]', 'L', 'abcdef') #expected behaviour, shown as sanity check
# [1] "LLLdef"

gsub('[[]]', 'B', 'it[]') #only 1 substitution, ie [] treated as a single character
# [1] "itB"

gsub('[\[\]]', 'B', 'it[]') #single escape, errors as expected

Error: '[' is an unrecognized escape in character string starting "'[["

gsub('[\\[\\]]', 'B', 'it[]') #double escape, single substitution
# [1] "itB"

gsub('[\\]\\[]', 'B', 'it[]') #double escape, reversed order, NO substitution
# [1] "it[]"

I'd prefer not to used fixed=TRUE with gsub since that will prevent me from using a character class. So, how do I include square brackets in a regex character class?

ETA additional trials:

gsub('[[\\]]', 'B', 'it[]') #double escape on closing ] only, single substitution
# [1] "itB"

gsub('[[\]]', 'B', 'it[]') #single escape on closing ] only, expected error

Error: ']' is an unrecognized escape in character string starting "'[[]"

ETA: the single substitution was caused by not setting perl=T in my gsub calls. ie:

gsub('[[\\]]', 'B', 'it[]', perl=T)
like image 946
dnagirl Avatar asked May 06 '13 13:05

dnagirl


2 Answers

You can use [:punct:], when you combine it with a negative lookahead

(?!['-])[[:punct:]]

This way a [:punct:]is only matched, if it is not in ['-]. The negative lookahead assertion (?!['-]) ensures this condition. It failes when the next character is a ' or a - and then the complete expression fails.

like image 155
stema Avatar answered Nov 18 '22 08:11

stema


Inside a character class you only need to escape the closing square bracket:

Try using '[[\\]]' or '[[\]]' (I am not sure about escaping the backslash as I don't know R.)

See this example.

like image 32
Daniel Hilgarth Avatar answered Nov 18 '22 08:11

Daniel Hilgarth