Replace all hyphen types by the ascii hyphen "-"

Question

Is there a way to replace all types of hyphens by the simple ASCII '-'? I am looking for something like this that works for spaces:

txt = re.sub(r'[\s]+',' ',txt)

I believe that some non-ASCII '-' hyphens are avoiding the correct process of removing some specific stopwords (name of projects that are connected by hyphens).

I want to replace this AR–L1003' for instance by AR-L1003, but I want to do this for the entire text.

trincot · Accepted Answer

You can just list those hyphens in a class. Here is one possible list -- extend it to your needs:

txt = re.sub(r'[‐᠆﹣－⁃−]+','-',txt)

The standard re library does not support the \p syntax for matching unicode categories, but if you can import regex, then it is possible:

import regex

txt = regex.sub(r'\p{Pd}+', '-', txt)

Replace all hyphen types by the ascii hyphen "-"

Tags:

DanielTheRocketMan

1 Answers

trincot

Recent Activity

Donate For Us

Replace all hyphen types by the ascii hyphen "-"

Tags:

DanielTheRocketMan

1 Answers

trincot

Related questions

Recent Activity

Donate For Us