Is there a way to replace all types of hyphens by the simple ASCII '-'
? I am looking for something like this that works for spaces:
txt = re.sub(r'[\s]+',' ',txt)
I believe that some non-ASCII '-'
hyphens are avoiding the correct process of removing some specific stopwords (name of projects that are connected by hyphens).
I want to replace this AR–L1003' for instance by AR-L1003, but I want to do this for the entire text.
You can just list those hyphens in a class. Here is one possible list -- extend it to your needs:
txt = re.sub(r'[‐᠆﹣-⁃−]+','-',txt)
The standard re
library does not support the \p
syntax for matching unicode categories, but if you can import regex
, then it is possible:
import regex
txt = regex.sub(r'\p{Pd}+', '-', txt)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With