Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing punctuation except intra-word dashes with a space

Tags:

python

regex

r

There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:

my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)

gives 'compactified on a calab yau threefold @ ,.'

So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation

Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'

like image 611
Antoine Avatar asked Feb 24 '16 21:02

Antoine


1 Answers

R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).

In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).

The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.

You can use the following solution:

import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No -  d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)

The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space.

If you want to remove sequences of non-word characters with one space, use

p = re.compile(r"(\b[-']\b)|[\W_]+") 
like image 72
Wiktor Stribiżew Avatar answered Nov 18 '22 20:11

Wiktor Stribiżew