re.findall regex hangs or very slow

Tags:

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).

I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.

The following code works flawlessly when I feed a small, 100kb sized file into it:

import codecs
import re

outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
    outfile.write(result)
outfile.close()

When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.

I am very sure that the source of the problem is the (.*?) part of the regex, in combination with re.DOTALL. When I use a similar regex on smaller distances, the script will run fine and fast. My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right? Am I maybe missing a more efficient solution?

Thanks in advance.

604

asked Mar 28 '12 00:03

sirio0816

1 Answers

You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.

The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:

`[^\n]*(.*)`

so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))

155

answered Oct 01 '22 19:10

SingleNegationElimination

Related questions
                            
                                Using custom formatter classes with Python's logging.config module
                            
                                Python inspect.getmembers does not return the actual function when used with decorators
                            
                                finding n largest differences between two lists
                            
                                Delete user when deleting UserProfile
                            
                                Preformat to currency and two decimal places in python using xlwt for excel
                            
                                Is there a way to incorporate python code into moinmoin pages?
                            
                                how does list(string) split the string to an array of characters in python?
                            
                                pickling class method
                            
                                How to select parent based on the child in lxml?
                            
                                How does web2py query expressions work?
                            
                                Web Scraping data using python?
                            
                                converting a latex code to mathml or svg code in python
                            
                                Place of the import statement
                            
                                How to tail a file with gevent
                            
                                find command with exec in python subprocess gives error
                            
                                Render equation to .png file using Python
                            
                                Python Selenium (waiting for frame, element lookups)
                            
                                Tokenizing large (>70MB) TXT file using Python NLTK. Concatenation & write data to stream errors
                            
                                How does Richardson–Lucy algorithm work? Code example?
                            
                                Finding minimum, maximum and average values for nested lists?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

re.findall regex hangs or very slow

Tags:

python

regex

sirio0816

People also ask

1 Answers

SingleNegationElimination

Recent Activity

Donate For Us