Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

Tags:

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.

import re  text = """\ Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. """ sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)  for stuff in sentences:         print(stuff)

Example output of what it should look like

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.  Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.

741

asked Sep 09 '14 01:09

user3590149

2 Answers

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

168

answered Sep 23 '22 15:09

vks

Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.

To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:

function isEndOfSentence(leftContext, rightContext)

Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if @midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a]. Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)

In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).

References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]

answered Sep 24 '22 15:09

smci

Related questions
                            
                                Passing data from Django to D3
                            
                                OpenCV Python: How to detect if a window is closed?
                            
                                'module' object has no attribute 'basicConfig'
                            
                                How to clear a multiprocessing queue in python
                            
                                Dot product of two vectors in tensorflow
                            
                                Cannot import models from another app in Django
                            
                                jupyter server : not started, no kernel in vs code
                            
                                Parsing hostname and port from string or url
                            
                                Convert all keys of a dictionary into lowercase [duplicate]
                            
                                Execute statement every N iterations in Python
                            
                                Make Sqlalchemy Use Date In Filter Using Postgresql
                            
                                How to get the current Linux process ID from the command line a in shell-agnostic, language-agnostic way
                            
                                Removing character in list of strings
                            
                                defaultdict is not defined
                            
                                how can I check database connection to mysql in django
                            
                                wxPython, Set value of StaticText()
                            
                                Storing and updating lists in Python dictionaries: why does this happen?
                            
                                python requests link headers
                            
                                Connecting to EC2 using keypair (.pem file) via Fabric
                            
                                Subtract Unless Negative Then Return 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

Tags:

python

regex

tokenize

nlp

user3590149

People also ask

2 Answers

vks

smci

Recent Activity

Donate For Us