Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python re can't split zero-width anchors? [duplicate]

Tags:

python

regex

import re

s = 'PythonCookbookListOfContents'

# the first line does not work
print re.split('(?<=[a-z])(?=[A-Z])', s ) 

# second line works well
print re.sub('(?<=[a-z])(?=[A-Z])', ' ', s)

# it should be ['Python', 'Cookbook', 'List', 'Of', 'Contents']

How to split a string from the border of a lower case character and an upper case character using Python re?

Why does the first line fail to work while the second line works well?

like image 675
Booster Avatar asked Dec 16 '15 16:12

Booster


1 Answers

According to re.split:

Note that split will never split a string on an empty pattern match. For example:

>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']

How about using re.findall instead? (Instead of focusing on separators, focus on the item you want to get.)

>>> import re
>>> s = 'PythonCookbookListOfContents'
>>> re.findall('[A-Z][a-z]+', s)
['Python', 'Cookbook', 'List', 'Of', 'Contents']

UPDATE

Using regex module (Alternative regular expression module, to replace re), you can split on zero-width match:

>>> import regex
>>> s = 'PythonCookbookListOfContents'
>>> regex.split('(?<=[a-z])(?=[A-Z])', s, flags=regex.VERSION1)
['Python', 'Cookbook', 'List', 'Of', 'Contents']

NOTE: Specify regex.VERSION1 flag to enable split-on-zero-length-match behavior.

like image 89
falsetru Avatar answered Nov 15 '22 16:11

falsetru