Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple negative lookbehind assertions in python regex?

Tags:

python

regex

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like

"\. A-Z"

However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.

I tried implementing the first half, but even this did not work. My code was:

"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
like image 328
Elip Avatar asked Oct 02 '12 11:10

Elip


People also ask

Does Python support negative Lookbehind?

The (? <! \$) is a negative lookbehind that does not match the $ sign. The \d+ matches a number with one or more digits.

What is negative Lookbehind regex?

In negative lookbehind the regex engine first finds a match for an item after that it traces back and tries to match a given item which is just before the main match. In case of a successful traceback match the match is a failure, otherwise it is a success.

What is Lookbehind in regex?

Lookbehind, which is used to match a phrase that is preceded by a user specified text. Positive lookbehind is syntaxed like (? <=a)something which can be used along with any regex parameter. The above phrase matches any "something" word that is preceded by an "a" word.

Can I use Lookbehind regex?

The good news is that you can use lookbehind anywhere in the regex, not only at the start.


4 Answers

First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).

Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).

Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.

The pattern for the first part of your question could be

(?<!Abs)(?<!S)(\. [A-Z])

or

(?<!Abs)(?<!S)(\.\s+[A-Z])

(with my suggestion)

That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.

To exclude the month names I came up with this regular expression:

(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]

The same arguments hold for the negative look ahead patterns.

like image 99
hochl Avatar answered Oct 27 '22 07:10

hochl


I'm adding a short answer to the question in the title, since this is at the top of Google's search results:

The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:

"(?<!1)(?<!12)(?<!123)example"

This would match example 2example and 3example but not 1example 12example or 123example.

like image 22
Nathan Wailes Avatar answered Oct 27 '22 07:10

Nathan Wailes


Use nltk punkt tokenizer. It's probably more robust than using regex.

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
like image 30
root Avatar answered Oct 27 '22 06:10

root


Use nltk or similar tools as suggested by @root.

To answer your regex question:

import re
import sys

print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
               sys.stdin.read())

Input

First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth

Output

['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
 'S. Sixth', 'ABs', 'Eighth']
like image 32
jfs Avatar answered Oct 27 '22 07:10

jfs