Python unicode regex not working on big strings [duplicate]

Question

For some reason, when using re.sub on a quite big unicode string, the function finds and replaces only first half of matches, ignoring the second part. However, when I reduce the size of the string (remove first half), it works OK. When I test the same on ASCII string, it also works OK.

Can anyone help me with figuring out the problem?

Code:

s = u"Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка."

# Find all capital letters, and add '!' before them
print re.sub(ur"([\u0410-\u042f])", ur"!\1", s, re.UNICODE)

Result:

!Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка.

As you see, the last part of the string wasn't processed.

Update:

As RomanPerekhrest discovered below, when we add additional flags (such as re.MULTILINE, re.VERBOSE, etc.) the function replaces a bit bigger part of the string, but still not full.

dawg · Accepted Answer

The function signature for re.sub has a count before the flags:

re.sub(pattern, repl, string, count=0, flags=0)

The behavior your are seeing is because you need to use flags as a keyword argument otherwise as a fourth position positional argument to sub it is considered a count. (Thanks JF Sebastian. Go vote up his answer)

Easy to demonstrate:

>>> re.sub(r'\d', '1', '0'*50, re.UNICODE)
'11111111111111111111111111111111000000000000000000'
>>> re.sub(r'\d', '1', '0'*50, re.M)
'11111111000000000000000000000000000000000000000000'

re.M has a value of 8 so only 8 replacements are made. Correcting the bug:

>>> re.sub(r'\d', '1', '0'*50, flags=re.M)
'11111111111111111111111111111111111111111111111111'

Side note:

The PyPi regex module supports more robust Unicode logic.

For example, if you want to modify any uppercase letter, you would use the metacharacter of \p{Lu}:

>>> s=u"Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка. Очень Длинная Строка."
>>> import regex
>>> print regex.sub(ur"(\p{Lu})", ur"!\1", s)
!Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка. !Очень !Длинная !Строка.

Python unicode regex not working on big strings [duplicate]

Tags:

python

string

regex

unicode

Update:

Vladimir Fomenko

1 Answers

dawg

Recent Activity

Donate For Us

Python unicode regex not working on big strings [duplicate]

Tags:

python

string

regex

unicode

Update:

Vladimir Fomenko

1 Answers

dawg

Related questions

Recent Activity

Donate For Us