There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string)
, but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold @ ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'
R uses TRE (POSIX) or PCRE regex engine depending on the perl
option (or function used). Python uses a modified, much poorer Perl-like version as re
library. Python does not support POSIX character classes, as [:alnum:]
that matches alpha (letters) and num (digits).
In Python, [:alnum:]
can be replaced with [^\W_]
(or ASCII only [a-zA-Z0-9]
) and the negated [^[:alnum:]]
- with [\W_]
([^a-zA-Z0-9]
ASCII only version).
The [^[:alnum:]['-]
matches any 1 symbol other than alphanumeric (letter or digit), [
, '
, or -
. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No - d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_]
regex matches and captures intraword -
and '
and we restore them in the re.sub
by checking if the capture group matched and re-inserting it with m.group(1)
, and the rest (all non-word characters and underscores) are just replaced with a space.
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With