I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split
method I search for a dot followed by a capital letter like
"\. A-Z"
However I need to refine this rule in the following way: The .
(dot) may not be preceeded by either Abs
or S
. And if it is followed by a capital letter (A-Z
), it should still not match if it is a month name, like January | February | March
.
I tried implementing the first half, but even this did not work. My code was:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
The (? <! \$) is a negative lookbehind that does not match the $ sign. The \d+ matches a number with one or more digits.
In negative lookbehind the regex engine first finds a match for an item after that it traces back and tries to match a given item which is just before the main match. In case of a successful traceback match the match is a failure, otherwise it is a success.
Lookbehind, which is used to match a phrase that is preceded by a user specified text. Positive lookbehind is syntaxed like (? <=a)something which can be used along with any regex parameter. The above phrase matches any "something" word that is preceded by an "a" word.
The good news is that you can use lookbehind anywhere in the regex, not only at the start.
First, I think you may want to replace the space with \s+
, or \s
if it really is exactly one space (you often find double spaces in English text).
Second, to match an uppercase letter you have to use [A-Z]
, but A-Z
will not work (but remember there may be other uppercase letters than A-Z
...).
Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z]
if it is not preceeded by Abs
or S
. The thing is that, if it is preceeded by an S
, it is not preceeded by Abs
, so the first pattern matches. If it is preceeded by Abs
, it is not preceeded by S
, so the second pattern version matches. In either way one of those patterns will match since Abs
and S
are mutually exclusive.
The pattern for the first part of your question could be
(?<!Abs)(?<!S)(\. [A-Z])
or
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion)
That is because you have to avoid |
, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.
To exclude the month names I came up with this regular expression:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:
"(?<!1)(?<!12)(?<!123)example"
This would match example
2example
and 3example
but not 1example
12example
or 123example
.
Use nltk punkt tokenizer. It's probably more robust than using regex.
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
Use nltk or similar tools as suggested by @root.
To answer your regex question:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With