I'm attempting to determine the end of an English sentence (only approximately), by looking for "!", "?" or ".", but in the case of "." only when not preceeded by common abbreviations such as Mr. or Dr.
Is there any way to make the following Regular Expression even marginally more efficient? Perhaps by sorting the negative lookbehinds by descending size, or even by alphabetical order?
Here is the regular expression I have now:
((?<!St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)(\.)|(!)|(\?))(\s*$|\s+([_$#]|[A-Z][^.]))
The Problem:
The site at http://regex.powertoy.org/ reports: "7 matches 21044 probes (finished)" on even a simple paragraph... That outrageous size of the number 21044 seems intimately tied to the number of negative lookbehinds.
I'm seeking to reduce the computational complexity for the RegEx engine, as I have several GB of data to pass through it.
Is there any way to tigten this up? Is negative lookbehind truly the best/only way to accomplish this? Is there a way to do it as a lookahead instead? Is regex the wrong tool for this task?
EDIT: I am able to use either ActionScript's or PHP's RegEx engine.
EDIT: I cannot count on the number of spaces between sentences. Really!? sigh.
Please don't answer if you have no understanding of the inner workings of the RegEx engine, as pertaining to optimization.
Thanks in advance.
Perhaps try only doing the negative-lookbehind test after successfully matching . rather than at every character:
(?x: # Allow spacing and comments
(
(\.) # First match "."
(?<! # Then negative-look-behind for titles followed by "."
(?: St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)
\.
)
| (!)
| (\?)
)
( \s* $ | \s+ ( [_$#] | [A-Z] [^.] ))
)
This took the number of probes from 70000 down to 2500 or so on powertoy.org using that site's initial help text. (But powertoy didn't like my multi-line regex or the "x" flag or something so I had to squash the regex onto one line to test it).
You can go even further by using common prefixes in your list of titles:
(?x: # Allow spacing and comments
(
(\.) # First match "."
(?<! # Then negative-look-behind for titles followed by "."
(?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
\.
)
| (!)
| (\?)
)
( \s* $ | \s+ ( [_$#] | [A-Z] [^.] ))
)
This took the probe count down to about 2000.
EDIT:
Another trick that reduces the probe-count is to include a look-ahead for a capital letter at the start of the look-behind section (but I couldn't say for sure that it makes the regex more efficient) (Also including @Swiss's suggestion for word-boundary):
(?<! # Then negative-look-behind for titles followed by "."
\b (?= [A-Z] ) # But first ensure we have a capital letter before going on
(?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
\.
)
I would match the period first. Instead of this:
(?<!St|Sgt|Rev|Ltd|Inc|...|Capt)\.
...do this:
\.(?<!(?:St|Sgt|Rev|Ltd|Inc|...|Capt)\.)
The way you have it, the regex engine has to perform the lookbehind before it even determines that the next character is a period. When I make that change, it goes from 28,423 probes to 1,945. (I'm using the default text provided by the Powertoy site, since you didn't supply any.)
I would also combine the next two alternatives - (!)|(\?)
- into one, i.e., ([!?])
. That brings the probe count down to 1,344. And if you don't have to capture the individual parts of the match, I suggest you use non-capturing parentheses--i.e., (?:...)
instead of (...)
. Though it isn't reflected in the probe count, it does make the regex more efficient.
EDIT: Looking closer at you regex, I see the reason it wasn't finding any matches is the [A-Z]
alternative. Since the whole match is being done in case-insensitive mode, that alternative Trumps all the others. You should either remove the /i
modifier or override locally with the (?-i:...)
construct. And adding the \b
(word boundary), as @Swiss suggested, is a good idea, too. That leaves you with:
(?-i:\.(?<!\b(?:St|Sgt|Rev|Ltd|Inc|...|[A-Z]|Assn|Capt)\.)
...which, along with the [!?]
change, yields 6 matches in 1404 probes on the Regex Powertoy site.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With