I've the following problem. I'm looking to find all words in a string that typically looks like so
HelloWorldToYou
Notice, each word is capitalized as a start followed by the next word and so on.
I'm looking to create a list of words from it. So the final expected output is a list that looks like
['Hello','World','To','You']
In Python, I used the following
mystr = 'HelloWorldToYou'
pat = re.compile(r'([A-Z](.*?))(?=[A-Z]+)')
[x[0] for x in pat.findall(mystr)]
['Hello', 'World', 'To']
However, I'm unable to capture the last word 'You'. Is there a way to get at this? Thanks in advance
Use the alternation with $
:
import re
mystr = 'HelloWorldToYou'
pat = re.compile(r'([A-Z][a-z]*)')
# or your version with `.*?`: pat = re.compile(r'([A-Z].*?)(?=[A-Z]+|$)')
print pat.findall(mystr)
See IDEONE demo
Output:
['Hello', 'World', 'To', 'You']
Regex explanation:
([A-Z][a-z]*)
- A capturing group that matches
[A-Z]
a capital English letter followed by[a-z]*
- optional number of lowercase English letters
.*?
- Match any characters other than a newline lazilyThe lookahead can be omitted if we use [a-z]*
, but if you use .*?
, then use it:
(?=[A-Z]+|$)
- Up to an uppercase English letter (we can actually remove +
here), OR the end of string ($
).If you do not use a look-ahead version, you can even remove the capturing group for better performance and use finditer
:
import re
mystr = 'HelloWorldToYou'
pat = re.compile(r'[A-Z][a-z]*')
print [x.group() for x in pat.finditer(mystr)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With