Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternatives to variable-width lookbehind in Python regex

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:

oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")

In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:

oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)

Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.

Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.

like image 231
tblznbits Avatar asked Sep 27 '22 06:09

tblznbits


People also ask

What is Lookbehind in regex?

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

Does SED support Lookbehind?

I created a test using grep but it does not work in sed . This works correctly by returning bar . I was expecting footest as output, but it did not work. sed does not support lookaround assertions.

What is negative lookahead regex?

In this type of lookahead the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is not present then the regex declares the match as a match otherwise it simply rejects that match.

What is negative lookbehind?

A negative lookbehind assertion asserts true if the pattern inside the lookbehind is not matched.


1 Answers

Notice that if you can use groups, you generally do not need lookbehinds. So how about

match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
    text = match.group(1)

In practice:

>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'
like image 142