Alternatives to variable-width lookbehind in Python regex

Tags:

I've recently decided to jump into the deep end of the Python pool and start converting some of my R code over to Python and I'm stuck on something that is very important to me. In my line of work, I spend a lot of time parsing text data, which, as we all know, is very unstructured. As a result, I've come to rely on the lookaround feature of regex and R's lookaround functionality is quite robust. For example, if I'm parsing a PDF that might introduce some spaces in between letters when I OCR the file, I'd get to the value I want with something like this:

oAcctNum <- str_extract(textBlock[indexVal], "(?<=ORIG\\s?:\\s?/\\s?)[A-Z0-9]+")

In Python, this isn't possible because the use of ? makes the lookbehind a variable-width expression as opposed to a fixed-width. This functionality is important enough to me that it deters me from wanting to use Python, but instead of giving up on the language I'd like to know the Pythonista way of addressing this issue. Would I have to preprocess the string before extracting the text? Something like this:

oAcctNum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "")
oAcctNum = re.search(r"(?<=ORIG:/)([A-Z0-9])", textBlock[indexVal]).group(1)

Is there a more efficient way to do this? Because while this example was trivial, this issue comes up in very complex ways with the data I work with and I'd hate to have to do this kind of preprocessing for every line of text I analyze.

Lastly, I apologize if this is not the right place to ask this question; I wasn't sure where else to post it. Thanks in advance.

231

asked Sep 27 '22 06:09

tblznbits

1 Answers

Notice that if you can use groups, you generally do not need lookbehinds. So how about

match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
if match:
    text = match.group(1)

In practice:

>>> string = 'ORIG : / AB123'
>>> match = re.search(r"ORIG\s?:\s?/\s?([A-Z0-9]+)", string)
>>> match
<_sre.SRE_Match object; span=(0, 12), match='ORIG : / AB123'>
>>> match.group(1)
'AB123'

142

answered Oct 20 '22 08:10

Antti Haapala -- Слава Україні

Related questions
                            
                                Is python's hash() portable?
                            
                                Efficient calculation on a pandas dataframe
                            
                                How to pass data from python to javascript in web2py
                            
                                Best way to get a map of a city using Basemap?
                            
                                Does scikit learn's fit_transform also transform my original dataframe?
                            
                                TypeError: boxplot() got an unexpected keyword argument 'labels'
                            
                                Python Flask how to use Response to serve from a generator from a mongo query
                            
                                Creating a custom Spark RDD in Python
                            
                                SQLAlchemy occasionally erroneously returns an empty result
                            
                                numpy.ndarray objects not garbage collected
                            
                                Python regex findall alternation behavior
                            
                                pytest setup_class() after fixture initialization
                            
                                Adding a new line character to a variable in python [duplicate]
                            
                                MNLogit in statsmodel returning nan
                            
                                How To Install PyBluez On Windows 8.1?
                            
                                Average over parts in list of lists
                            
                                Equivalent of count list function in numpy array
                            
                                subprocess.wait() not waiting for Popen process to finish (when using threads)?
                            
                                How can I get the final redirect URL when using urllib2.urlopen?
                            
                                How to discover current role in Python Fabric

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Alternatives to variable-width lookbehind in Python regex

Tags:

python

regex

lookaround

tblznbits

People also ask

1 Answers

Antti Haapala -- Слава Україні

Recent Activity

Donate For Us