Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split String into alpha & punctuation with exceptions regex

Tags:

python

regex

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character

b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)

really???? = ['really','????'] (split when any other special char encountered)

I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)` How to proceed further?

like image 580
xerxes01 Avatar asked Jun 28 '26 16:06

xerxes01


1 Answers

You can use

[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]

See this regex demo

Details

  • \w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
  • | - or
  • \w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
  • | - or
  • [^\w\s]+ - one or more non-word and non-whitespace chars.

The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

like image 93
Wiktor Stribiżew Avatar answered Jun 30 '26 05:06

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!