I remember skimming the sentence segmentation section from the NLTK site a long time ago.
I use a crude text replacement of “period” “space” with “period” “manual line break” to achieve sentence segmentation, such as with a Microsoft Word replacement (.
-> .^p
) or a Chrome extension:
https://github.com/AhmadHassanAwan/Sentence-Segmentation
https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha
This is instead of an NLP method like the Punkt tokenizer of NLTK.
I segment to help me more easily locate and reread sentences, which can sometimes help with reading comprehension.
What about independent clause boundary disambiguation, and independent clause segmentation? Are there any tools that attempt to do this?
Below is some example text. If an independent clause can be identified within a sentence, there’s a split. Starting from the end of a sentence, it moves left, and greedily splits:
E.g.
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where
sentences begin and end.
Often, natural language processing tools
require their input to be divided into sentences for a number of reasons.
However, sentence boundary identification is challenging because punctuation
marks are often ambiguous.
For example, a period may
denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence.
About 47% of the periods in the Wall Street Journal corpus
denote abbreviations.[1]
As well, question marks and exclamation marks may
appear in embedded quotations, emoticons, computer code, and slang.
Another approach is to automatically
learn a set of rules from a set of documents where the sentence
breaks are pre-marked.
Languages like Japanese and Chinese
have unambiguous sentence-ending markers.
The standard 'vanilla' approach to
locate the end of a sentence:
(a) If
it's a period,
it ends a sentence.
(b) If the preceding
token is on my hand-compiled list of abbreviations, then
it doesn't end a sentence.
(c) If the next
token is capitalized, then
it ends a sentence.
This
strategy gets about 95% of sentences correct.[2]
Solutions have been based on a maximum entropy model.[3]
The SATZ architecture uses a neural network to
disambiguate sentence boundaries and achieves 98.5% accuracy.
(I’m not sure if I split it properly.)
If there are no means to segment independent clauses, are there any search terms that I can use to further explore this topic?
Thanks.
To the best of my knowledge, there is no readily available tool to solve this exact problem. Usually, NLP systems do not get into the problem of identifying different types of sentences and clauses as defined by English grammar. There is one paper published in EMNLP which provides an algorithm which uses the SBAR
tag in parse trees to identify independent and dependent clauses in a sentence.
You should find section 3 of this paper useful. It talks about English language syntax in some details, but I don't think the entire paper is relevant to your question.
Note that they have used the Berkeley parser (demo available here), but you can obviously any other constituency parsing tool (e.g. the Stanford parser demo available here).
Chthonic Project gives some good information here:
Clause Extraction using Stanford parser
Part of the answer:
It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies.
The clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction.
All you need to do is the following:
- Identify the non-root clausal nodes in the parse tree
- Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
- In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.
For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html
For an online parse-tree visualization, you may want to use the online Berkeley parser demo.
It helps a lot in forming a better intuition.
Here's the image generated for your example sentence:
I don't know any tools that do clause segmentation, but in rhetorical structure theory, there is a concept called "elementary discourse unit" which work in a similar way as a clause. They are sometimes, however, slightly smaller than clauses.
You may see the section 2.0 of this manual for more information about this concept:
https://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
There are some software available online that can segment sentence into their elementary discourse unit , for instance:
http://alt.qcri.org/tools/discourse-parser/
and
https://github.com/jiyfeng/DPLP
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With