I am looking for a sentence segmentor that can split compound sentences into simple sentences.
Example:
Input: Andrea is beautiful but she is strict.
(expected) Output: Andrea is beautiful. she is strict.
Input: i am andrea and i work for google.
(expected) Output: i am andrea. i work for google.
Input: Italy is my favorite country; i plan to spend two weeks there next year.
(expected) Output: Italy is my favorite country. i plan to spend two weeks there next year.
Any recommendations? I tried NLTK, spacy, segtok, nlp-compromise but they don't work on these complex examples (I understand this is a difficult problem, thus no easy solutions).
A compound sentence can be converted into a simple sentence by reducing one or more main clauses into a word or phrase. Study the following examples. Compound: He must run fast or he will not catch the train. Simple: He must run fast to catch the train.
They can be made into one compound sentence with a semicolon alone between the two independent clauses. The semicolon has more strength than the comma. Thus, it can separate two independent clauses by itself; a comma cannot separate two independent clauses unless it is followed by a coordinating conjunction (FANBOYS).
There are four techniques used to join independent clauses in a compound sentence: • a comma and a coordinating conjunction (for, and, nor, but, or, yet, so). a semicolon. a semicolon and a transition word (therefore, however, hence, thus…etc.). a colon.
First of all, you need to better define what a "simple sentence" means to you from a linguistic (grammar) perspective. You can say, for example, that simple sentence are:
In short, you have many alternative for defining this, and depending on your need your "rule" should be more (or less) rigorous because it will impact your algorithm design and (of course) your output.
I would suggest you 2 basic instructions
Demo using your provided examples:
Spacy output these trees input1 and input2.
You may notice that using conj
as delimiter and merging the remaining subtrees, it returns the output you expected.
You can do the same for your input3 after split by punctuation as I mentioned above.
Finally, this is not a straightforward task, you may be fine with these simple rules, but if you need better results first improve your definitions about what a "compound' or "simple" sentence means, and have a look at more sophisticated algorithms using Machine Learning.
Although a very old question, it would be nice to know if this helps :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With