Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: how to interrupt a regex match

I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.

I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

but I do neither get the timeout_handler called nor the do we ever get here ? line in the output, the code is just stuck in parse_data_files.

Even worse, I can't even stop the program with CTRL-C, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388

I did achieve some success using signals:

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C. If I now modify the timeout_handler like this:

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:

  • the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
  • the timer starts counting right there, not when the actual parsing starts
  • because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
  • I've read several times now that threads and signals do not mix very well

I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.

Update

the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):

'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I do also get cases where the regex seems to return right away and reports a non-match.

Update 2

I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.

I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.

like image 976
ssc Avatar asked Mar 14 '11 11:03

ssc


People also ask

How do you stop greedy in regex?

You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".

What is ?: In regex?

'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)

What is re compile in python?

Python's re. compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .

What is r in regex python?

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!).


4 Answers

You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.

Aside from the fact that it looks more like a job for a CSV parser, try the following:

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$'

By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.

If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with

(?:"[^"]*"|[^,]+)

To illustrate:

Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).

Using my regex, the successful match happens in 173 steps, the failure in 174.

like image 119
Tim Pietzcker Avatar answered Oct 06 '22 00:10

Tim Pietzcker


You can't do it with threads. Go ahead with your idea of doing the match in a separate process.

like image 25
nosklo Avatar answered Oct 05 '22 23:10

nosklo


Instead of trying to solve the regexp hangup issue with timeouts, maybe it would be worthwhile to consider a completely different kind of approach. If your data really is just comma-separated values, you should get much better performance with the csv-module or just using line.split(",").

like image 28
shang Avatar answered Oct 06 '22 00:10

shang


Threading in Python is a weird beast. The Global Interpreter Lock is essentially one big Lock around the interpreter, that means only one thread at a time gets to execute within the interpreter.

Thread scheduling is delegated to the OS. Python essentially signals the OS that another thread may take the lock after a certain number of 'instructions'. So if Python is busy due to a run-away regular expression, it never gets the chance to signal the OS that it may try to take the lock for another thread. Hence the reason for using signals; they are the only way to interrupt.

I'm with Nosklo, go ahead and use separate processes. Or, try to rewrite the regular expression so that it doesn't run away. See the problems associated with backtracking. This may or may not be the cause for the poor regex performance, and changing your regex may not be possible. But if this is the cause and it can be changed, you'll save yourself a whole lot of headache by avoiding multiple processes.

like image 38
Josh Smeaton Avatar answered Oct 06 '22 00:10

Josh Smeaton