Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re.findall regex hangs or very slow

Tags:

python

regex

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).

I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.

The following code works flawlessly when I feed a small, 100kb sized file into it:

import codecs
import re

outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
    outfile.write(result)
outfile.close()

When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.

I am very sure that the source of the problem is the (.*?) part of the regex, in combination with re.DOTALL. When I use a similar regex on smaller distances, the script will run fine and fast. My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right? Am I maybe missing a more efficient solution?

Thanks in advance.

like image 604
sirio0816 Avatar asked Mar 28 '12 00:03

sirio0816


People also ask

What does regex Findall return?

Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.

Is Re search slow?

Python regex: re.search() is extremely slow on large text files.

What is the difference between re search and re Findall?

Here you can see that, search() method is able to find a pattern from any position of the string. The re. findall() helps to get a list of all matching patterns. It searches from start or end of the given string.

Is re fast Python?

The re.search() and re. match() both are functions of re module in python. These functions are very efficient and fast for searching in strings. The function searches for some substring in a string and returns a match object if found, else it returns none.


1 Answers

You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.

The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:

`[^\n]*(.*)`

so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))

like image 155
SingleNegationElimination Avatar answered Oct 01 '22 19:10

SingleNegationElimination