Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?

For example consider the following string:

"ThisisAtestForchEck,Match IngwithPosition." 

So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']

If i split with grouping i get:

>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']

And this is the result with look-around :

>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']

Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:

['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']

I can use re.findall, viz.:

>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']

But what about the first example: is it possible to solve it with re.findall?

like image 537
Mazdak Avatar asked May 01 '26 01:05

Mazdak


2 Answers

A way with re.findall:

re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)

When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"

like image 156
Casimir et Hippolyte Avatar answered May 02 '26 13:05

Casimir et Hippolyte


 (?<!\s)(?=[A-Z])

You can use this to split with regex module as re does not support split at 0 width assertions.

import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)

or

print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]

See demo.

https://regex101.com/r/sJ9gM7/65

like image 36
vks Avatar answered May 02 '26 14:05

vks



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!