I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.
I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):
def timeout_handler():
print 'timeout_handler called'
if __name__ == '__main__':
timer_thread = Timer(8.0, timeout_handler)
parse_thread = Thread(target=parse_data_files, args=(my_args))
timer_thread.start()
parse_thread.start()
parse_thread.join(12.0)
print 'do we ever get here ?'
but I do neither get the timeout_handler called
nor the do we ever get here ?
line in the output, the code is just stuck in parse_data_files
.
Even worse, I can't even stop the program with CTRL-C
, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388
I did achieve some success using signals:
signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)
this gets me the timeout_handler called
line in the output - and I can still stop my script using CTRL-C
. If I now modify the timeout_handler like this:
class TimeoutException(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutException()
and enclose the actual call to re.match(...)
in a try
... except TimeoutException
clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:
I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.
the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):
'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'
sample data:
95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -
As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.
`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999
I do also get cases where the regex seems to return right away and reports a non-match.
I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.
I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".
'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
Python's re. compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object ( re. Pattern ). Later we can use this pattern object to search for a match inside different target strings using regex methods such as a re. match() or re.search() .
The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!).
You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.
Aside from the fact that it looks more like a job for a CSV parser, try the following:
r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$'
By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.
If commas may be present inside quoted strings, for example, then just exchange [^,]+
(in places where you'd expect this) with
(?:"[^"]*"|[^,]+)
To illustrate:
Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).
Using my regex, the successful match happens in 173 steps, the failure in 174.
You can't do it with threads. Go ahead with your idea of doing the match in a separate process.
Instead of trying to solve the regexp hangup issue with timeouts, maybe it would be worthwhile to consider a completely different kind of approach. If your data really is just comma-separated values, you should get much better performance with the csv
-module or just using line.split(",")
.
Threading in Python is a weird beast. The Global Interpreter Lock is essentially one big Lock around the interpreter, that means only one thread at a time gets to execute within the interpreter.
Thread scheduling is delegated to the OS. Python essentially signals the OS that another thread may take the lock after a certain number of 'instructions'. So if Python is busy due to a run-away regular expression, it never gets the chance to signal the OS that it may try to take the lock for another thread. Hence the reason for using signals; they are the only way to interrupt.
I'm with Nosklo, go ahead and use separate processes. Or, try to rewrite the regular expression so that it doesn't run away. See the problems associated with backtracking. This may or may not be the cause for the poor regex performance, and changing your regex may not be possible. But if this is the cause and it can be changed, you'll save yourself a whole lot of headache by avoiding multiple processes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With