Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a Regex to a Substring Without using String Slice

Tags:

python

regex

I want to search for a regex match in a larger string from a certain position onwards, and without using string slices.

My background is that I want to search through a string iteratively for matches of various regex's. A natural solution in Python would be keeping track of the current position within the string and using e.g.

re.match(regex, largeString[pos:])

in a loop. But for really large strings (~ 1MB) string slicing as in largeString[pos:] becomes expensive. I'm looking for a way to get around that.

Side note: Funnily, in a niche of the Python documentation, it talks about an optional pos parameter to the match function (which would be exactly what I want), which is not to be found with the functions themselves :-).

like image 832
ThomasH Avatar asked Jun 09 '11 09:06

ThomasH


People also ask

Does regex only work with strings?

So, yes, regular expressions really only apply to strings. If you want a more complicated FSM, then it's possible to write one, but not using your local regex engine.

What is substring regex?

REGEXP_SUBSTR extends the functionality of the SUBSTR function by letting you search a string for a regular expression pattern. It is also similar to REGEXP_INSTR , but instead of returning the position of the substring, it returns the substring itself.

What does \f mean in regex?

\f stands for form feed, which is a special character used to instruct the printer to start a new page.


2 Answers

The variants with pos and endpos parameters only exist as members of regular expression objects. Try this:

import re
pattern = re.compile("match here")
input = "don't match here, but do match here"
start = input.find(",")
print pattern.search(input, start).span()

... outputs (25, 35)

like image 113
Martin Stone Avatar answered Oct 11 '22 03:10

Martin Stone


The pos keyword is only available in the method versions. For example,

re.match("e+", "eee3", pos=1)

is invalid, but

pattern = re.compile("e+")
pattern.match("eee3", pos=1)

works.

like image 36
Jeremy Avatar answered Oct 11 '22 03:10

Jeremy