Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest regular expression that does not match any string

Tags:

python

regex

What is the fastest performing regular expression that does not match any string? It may seem like a useless thing, but consider a program that takes a mandatory regex as a filter for instance (this is actually my scenario). I've tried a few and found b(?<!b) to be the best performer given that b occurs rarely in the input.

Here is a python code I wrote to test different patterns for their speed:

#!/usr/bin/env python

import re
import time

tests = [
  r'a\A',
  r'b\A',
  r'a^',
  r'b^',
  r'[^\s\S]',
  r'^(?<=a)',
  r'^(?<=b)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
text = 'a' * 50000000

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))

On my machine, I get the following times:

Pattern                        Time
b(?<!b)                        0.043
b\A                            0.043
b^                             0.043
$a                             0.382
$b                             0.382
^(?<=a)                        0.395
\Za                            0.395
\Zb                            0.395
^(?<=b)                        0.414
a\A                            0.437
a^                             0.440
a(?<!a)                        0.796
[^\s\S]                        1.469

update: added benchmark for some of suggested regexes.

like image 335
Mansour Avatar asked Oct 21 '22 18:10

Mansour


1 Answers

A single character is a valid regular expression. A single character that is not "magic" matches itself. If you can identify a single character that will never, ever appear in your text, you could make a pattern from that.

How about ASCII NUL, character 0?

I stuck in one more string in your test program, the string: '\0'

It was about as fast as your best pattern: b(?<!b)

Okay, you already have a character after the end of the string. How about a character before the start of the string? That's impossible: 'x^'

Aha! That's faster than checking for a character after end of string. But it's about as fast as your best pattern.

I suggest replacing the b with an ASCII NUL and calling it good. When I tried that pattern: \0(?<!\0)

It wins by a tiny fraction. But really, on my computer, all the ones discussed above are so close together that there isn't much to distinguish them.

Results:

Pattern                        Time
\0(?<!\0)                      0.098
\0                             0.099
x^                             0.099
b(?<!b)                        0.099
^(?<=x)                        1.416
$b                             1.446
$a                             1.447
\Za                            1.462
\Zb                            1.465
[^\s\S]                        2.280
a(?<!a)                        2.843

That was fun. Thanks for posting the question.

EDIT: Ah hah! I rewrote the program to test with real input data, and got a different result.

I downloaded "The Complete Works of William Shakespeare" from Project Gutenberg as a text file. (Weird, it gave an error on wget but let my browser get it... some sort of measure to protect against automated copying?) URL: http://www.gutenberg.org/cache/epub/100/pg100.txt

Here are the results, followed by the modified program as I ran it.

Pattern                        Time
\0(?<!\0)                      0.110
\0                             0.118
x^                             0.119
b(?<!b)                        0.143
a(?<!a)                        0.275
^(?<=x)                        1.577
$b                             1.605
$a                             1.611
\Za                            1.634
\Zb                            1.634
[^\s\S]                        2.441

So yeah I'm definitely going with that first one.

#!/usr/bin/env python

import re
import time

tests = [
  r'x^',
  r'\0',
  r'[^\s\S]',
  r'^(?<=x)',
  r'a(?<!a)',
  r'b(?<!b)',
  r'\0(?<!\0)',
  r'\Za',
  r'\Zb',
  r'$a',
  r'$b'
]
timing = []
#text = 'a' * 50000000
text = open("/tmp/pg100.txt").read()
text = text * 10

for t in tests:
  pat = re.compile(t)
  start = time.time()
  pat.search(text)
  dur = time.time() - start
  timing.append((t, dur))

timing.sort(key=lambda x: x[1])
print('%-30s %s' % ('Pattern', 'Time'))
for t, dur in timing:
  print('%-30s %0.3f' % (t, dur))
like image 173
steveha Avatar answered Oct 24 '22 17:10

steveha