Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex pattern max length in re.compile?

I try to compile a big pattern with re.compile in Python 3.

The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words

Python doesn't raise any error.

What I do is:

stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)

The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!

Is there a max length for the regex pattern?

like image 244
mquantin Avatar asked May 13 '15 17:05

mquantin


People also ask

How do you limit the length of a regular expression?

By combining the interval quantifier with the surrounding start- and end-of-string anchors, the regex will fail to match if the subject text's length falls outside the desired range.

What is regex re compile?

Python's re. compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .

What is the output of re match in Python?

re.search() is returning match object and implies that first match found at index 69. re. match() is returning none because match exists in the second line of the string and re. match() only works if the match is found at the beginning of the string.

What does re sub () do?

re. sub() function is used to replace occurrences of a particular sub-string with another sub-string. This function takes as input the following: The sub-string to replace.


1 Answers

Consider this example:

import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stopstring = "|".join(stop_list)
stopword_pattern = re.compile(stopstring)

If you try to print the pattern, you'll see something like

>>> print(stopword_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)

which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:

>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')
like image 93
chepner Avatar answered Oct 01 '22 19:10

chepner