What is the fastest performing regular expression that does not match any string? It may seem like a useless thing, but consider a program that takes a mandatory regex as a filter for instance (this is actually my scenario). I've tried a few and found b(?<!b)
to be the best performer given that b
occurs rarely in the input.
Here is a python code I wrote to test different patterns for their speed:
#!/usr/bin/env python
import re
import time
tests = [
r'a\A',
r'b\A',
r'a^',
r'b^',
r'[^\s\S]',
r'^(?<=a)',
r'^(?<=b)',
r'a(?<!a)',
r'b(?<!b)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
timing = []
text = 'a' * 50000000
for t in tests:
pat = re.compile(t)
start = time.time()
pat.search(text)
dur = time.time() - start
timing.append((t, dur))
timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
print('%-30s %0.3f' % (t, dur))
On my machine, I get the following times:
Pattern Time
b(?<!b) 0.043
b\A 0.043
b^ 0.043
$a 0.382
$b 0.382
^(?<=a) 0.395
\Za 0.395
\Zb 0.395
^(?<=b) 0.414
a\A 0.437
a^ 0.440
a(?<!a) 0.796
[^\s\S] 1.469
update: added benchmark for some of suggested regexes.
A single character is a valid regular expression. A single character that is not "magic" matches itself. If you can identify a single character that will never, ever appear in your text, you could make a pattern from that.
How about ASCII NUL, character 0?
I stuck in one more string in your test program, the string: '\0'
It was about as fast as your best pattern: b(?<!b)
Okay, you already have a character after the end of the string. How about a character before the start of the string? That's impossible: 'x^'
Aha! That's faster than checking for a character after end of string. But it's about as fast as your best pattern.
I suggest replacing the b
with an ASCII NUL and calling it good. When I tried that pattern: \0(?<!\0)
It wins by a tiny fraction. But really, on my computer, all the ones discussed above are so close together that there isn't much to distinguish them.
Results:
Pattern Time
\0(?<!\0) 0.098
\0 0.099
x^ 0.099
b(?<!b) 0.099
^(?<=x) 1.416
$b 1.446
$a 1.447
\Za 1.462
\Zb 1.465
[^\s\S] 2.280
a(?<!a) 2.843
That was fun. Thanks for posting the question.
EDIT: Ah hah! I rewrote the program to test with real input data, and got a different result.
I downloaded "The Complete Works of William Shakespeare" from Project Gutenberg as a text file. (Weird, it gave an error on wget
but let my browser get it... some sort of measure to protect against automated copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txt
Here are the results, followed by the modified program as I ran it.
Pattern Time
\0(?<!\0) 0.110
\0 0.118
x^ 0.119
b(?<!b) 0.143
a(?<!a) 0.275
^(?<=x) 1.577
$b 1.605
$a 1.611
\Za 1.634
\Zb 1.634
[^\s\S] 2.441
So yeah I'm definitely going with that first one.
#!/usr/bin/env python
import re
import time
tests = [
r'x^',
r'\0',
r'[^\s\S]',
r'^(?<=x)',
r'a(?<!a)',
r'b(?<!b)',
r'\0(?<!\0)',
r'\Za',
r'\Zb',
r'$a',
r'$b'
]
timing = []
#text = 'a' * 50000000
text = open("/tmp/pg100.txt").read()
text = text * 10
for t in tests:
pat = re.compile(t)
start = time.time()
pat.search(text)
dur = time.time() - start
timing.append((t, dur))
timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
print('%-30s %0.3f' % (t, dur))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With