I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the <code>.split</code> method I search for a dot followed by a capital letter like <pre class="prettyprint"><code>"\. A-Z" </code></pre> However I need to refine this rule in the following way: The <code>.</code> (dot) may not be preceeded by either <code>Abs</code> or <code>S</code>. And if it is followed by a capital letter (<code>A-Z</code>), it should still not match if it is a month name, like <code>January | February | March</code>. I tried implementing the first half, but even this did not work. My code was: <pre class="prettyprint"><code>"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) " </code></pre>

I'm adding a short answer to the question in the title, since this is at the top of Google's search results: The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this: <code>"(?<!1)(?<!12)(?<!123)example"</code> This would match <code>example</code> <code>2example</code> and <code>3example</code> but not <code>1example</code> <code>12example</code> or <code>123example</code>.

Use nltk punkt tokenizer. It's <s>probably</s> more robust than using regex. <pre class="prettyprint"><code>>>> import nltk.data >>> text = """ ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... """ >>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') >>> print '\n-----\n'.join(sent_detector.tokenize(text.strip())) Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. ----- And sometimes sentences can start with non-capitalized words. ----- i is a good variable name. </code></pre>

Use nltk or similar tools as suggested by @root. To answer your regex question: <pre class="prettyprint"><code>import re import sys print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])", sys.stdin.read()) </code></pre> <h3>Input</h3> <pre class="prettyprint"><code>First. Second. January. Third. Abs. Forth. S. Fifth. S. Sixth. ABs. Eighth </code></pre> <h3>Output</h3> <pre class="prettyprint"><code>['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth', 'S. Sixth', 'ABs', 'Eighth'] </code></pre>

Multiple negative lookbehind assertions in python regex?

Q: Does Python support negative Lookbehind?

The (? <! \$) is a negative lookbehind that does not match the $ sign. The \d+ matches a number with one or more digits.

Q: What is negative Lookbehind regex?

In negative lookbehind the regex engine first finds a match for an item after that it traces back and tries to match a given item which is just before the main match. In case of a successful traceback match the match is a failure, otherwise it is a success.

Q: What is Lookbehind in regex?

Lookbehind, which is used to match a phrase that is preceded by a user specified text. Positive lookbehind is syntaxed like (? <=a)something which can be used along with any regex parameter. The above phrase matches any "something" word that is preceded by an "a" word.

Q: Can I use Lookbehind regex?

The good news is that you can use lookbehind anywhere in the regex, not only at the start.

Tags:

python

regex

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions. With the .split method I search for a dot followed by a capital letter like

"\. A-Z"

However I need to refine this rule in the following way: The . (dot) may not be preceeded by either Abs or S. And if it is followed by a capital letter (A-Z), it should still not match if it is a month name, like January | February | March.

I tried implementing the first half, but even this did not work. My code was:

"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "

328

asked Oct 02 '12 11:10

Elip

4 Answers

First, I think you may want to replace the space with \s+, or \s if it really is exactly one space (you often find double spaces in English text).

Second, to match an uppercase letter you have to use [A-Z], but A-Z will not work (but remember there may be other uppercase letters than A-Z ...).

Additionally, I think I know why this does not work. The regular expression engine will try to match \. [A-Z] if it is not preceeded by Abs or S. The thing is that, if it is preceeded by an S, it is not preceeded by Abs, so the first pattern matches. If it is preceeded by Abs, it is not preceeded by S, so the second pattern version matches. In either way one of those patterns will match since Abs and S are mutually exclusive.

The pattern for the first part of your question could be

(?<!Abs)(?<!S)(\. [A-Z])

(?<!Abs)(?<!S)(\.\s+[A-Z])

(with my suggestion)

That is because you have to avoid |, without it the expression now says not preceeded by Abs and not preceeded by S. If both are true the pattern matcher will continue to scan the string and find your match.

To exclude the month names I came up with this regular expression:

(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]

The same arguments hold for the negative look ahead patterns.

answered Oct 27 '22 07:10

hochl

I'm adding a short answer to the question in the title, since this is at the top of Google's search results:

The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:

"(?<!1)(?<!12)(?<!123)example"

This would match example 2example and 3example but not 1example 12example or 123example.

answered Oct 27 '22 07:10

Nathan Wailes

Use nltk punkt tokenizer. It's ~~probably~~ more robust than using regex.

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

answered Oct 27 '22 06:10

root

Use nltk or similar tools as suggested by @root.

To answer your regex question:

import re
import sys

print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
               sys.stdin.read())

Input

First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth

Output

['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
 'S. Sixth', 'ABs', 'Eighth']

answered Oct 27 '22 07:10

jfs

Related questions
                            
                                Pip install matplotlib fails on M1 Mac
                            
                                Database for Python Twisted
                            
                                floating point equality in Python and in general
                            
                                Change default float print format
                            
                                In wxPython how do you bind a EVT_KEY_DOWN event to the whole window?
                            
                                What is __return__?
                            
                                How to use raw python code in a Django template?
                            
                                Can I have some code constantly run inside Django like a daemon
                            
                                Python function that accepts file object or path
                            
                                Calling another view in Pyramid
                            
                                What are some python libraries that use finite elements to solve structural two and three dimensional frames? [closed]
                            
                                virtualenvwrapper.sh error showing at terminal startup
                            
                                lambda in python
                            
                                How do you convert a naive datetime to DST-aware datetime in Python?
                            
                                Putting command in the background with Fabric does not work on some hosts
                            
                                Format a string that has extra curly braces in it
                            
                                how do i redirect the output of nosetests to a textfile?
                            
                                Instance is an "object", but class is not a subclass of "object": how is this possible?
                            
                                matplotlib linked x axes with autoscaled y axes on zoom
                            
                                Choosing between Scons and Waf in Large Projects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple negative lookbehind assertions in python regex?

Tags:

python

regex

Elip

People also ask

4 Answers

hochl

Nathan Wailes

root

Input

Output

jfs

Recent Activity

Donate For Us