Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split compound sentences into simple sentences

Tags:

nlp

chatbot

I am looking for a sentence segmentor that can split compound sentences into simple sentences.

Example:

Input: Andrea is beautiful but she is strict.
(expected) Output: Andrea is beautiful. she is strict.

Input: i am andrea and i work for google. 
(expected) Output: i am andrea. i work for google.

Input: Italy is my favorite country; i plan to spend two weeks there next year.
(expected) Output: Italy is my favorite country. i plan to spend two weeks there next year.

Any recommendations? I tried NLTK, spacy, segtok, nlp-compromise but they don't work on these complex examples (I understand this is a difficult problem, thus no easy solutions).

like image 788
Anuj Gupta Avatar asked Jun 19 '17 09:06

Anuj Gupta


People also ask

How do you turn a compound sentence into a simple sentence?

A compound sentence can be converted into a simple sentence by reducing one or more main clauses into a word or phrase. Study the following examples. Compound: He must run fast or he will not catch the train. Simple: He must run fast to catch the train.

What splits a compound sentence?

They can be made into one compound sentence with a semicolon alone between the two independent clauses. The semicolon has more strength than the comma. Thus, it can separate two independent clauses by itself; a comma cannot separate two independent clauses unless it is followed by a coordinating conjunction (FANBOYS).

How do you separate compound sentences?

There are four techniques used to join independent clauses in a compound sentence: • a comma and a coordinating conjunction (for, and, nor, but, or, yet, so). a semicolon. a semicolon and a transition word (therefore, however, hence, thus…etc.). a colon.


1 Answers

First of all, you need to better define what a "simple sentence" means to you from a linguistic (grammar) perspective. You can say, for example, that simple sentence are:

  • just text without punctuation in the middle (periods, commas, colons, etc)
  • those with a single verb. In that case you will deal with hierarchy where a sentence is "completed" by reusing another.
  • a phrase-like text, where conjunctions can act as delimiters too.

In short, you have many alternative for defining this, and depending on your need your "rule" should be more (or less) rigorous because it will impact your algorithm design and (of course) your output.

I would suggest you 2 basic instructions

  1. split by punctuation, so you will have "simpler sentences" (e.g. your input3)
  2. input each of those to a dependency parser such as Spacy, and take advantage of the dependency links as delimiters.

Demo using your provided examples:
Spacy output these trees input1 and input2. You may notice that using conj as delimiter and merging the remaining subtrees, it returns the output you expected. You can do the same for your input3 after split by punctuation as I mentioned above.

Finally, this is not a straightforward task, you may be fine with these simple rules, but if you need better results first improve your definitions about what a "compound' or "simple" sentence means, and have a look at more sophisticated algorithms using Machine Learning.

Although a very old question, it would be nice to know if this helps :)

like image 173
Jason Angel Avatar answered Oct 08 '22 13:10

Jason Angel