Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling conjunctions when splitting sentences using core-nlp's DocumentPreprocessor

I am trying to split a given text into sentences using the core-nlps' DocumentPreprocessor method.

Below is the code which I'm using.

List<String> splitSentencesList = new ArrayList<>();
Reader reader = new StringReader(inputText);
DocumentPreprocessor dp = new DocumentPreprocessor(reader); 
 for(List<HasWord> sentence :dp){
               splitSentencesList.add(Sentence.listToString(sentence).toLowerCase().replace(" .", ""));} 

This works for most of the cases. But, how do we handle conjunctions within a sentence?

E.g:

I like coffee and donuts for my breakfast.

Ideally, which should be further handled as :

I like coffee for my breakfast.
I like donuts for my breakfast.

One option is to do a regex based rule to split them further. Is there any inbuilt method to achieve this in core-nlp.

any pointers on this is appreciated.

like image 595
Betafish Avatar asked Jul 18 '17 11:07

Betafish


People also ask

How do you split sentences in NLP?

Splitting textual data into sentences can be considered as an easy task, where a text can be splitted to sentences by '. ' or '/n' characters. However, in free text data this pattern is not consistent, and authors can break a line in the middle of the sentence or use “.” in wrong places.

How do you split sentences?

For splitting sentences first mark the clauses. Then make sub-clauses independent by omitting subordinating linkers and inserting subjects or other words wherever necessary. Example – When I went to Delhi I met my friend who lives there. Clause 1 (When) I went to Delhi.

Which of the following function is used to break given text into sentences?

To perform sentence tokenization, we can use the re. split() function. This will split the text into sentences by passing a pattern into it.

How do you split multiple sentences in Python?

Use sent_tokenize() to split text into sentences Call nltk. tokenize.


1 Answers

The simple answer is: you can't do that using the DocumentPreprocessor. It is designed to split your sentences based on punctuation. There is no way to tell it to split a sentence (or rather duplicate it), when a conjunction (like and) is present.

Your idea to use a regex might just be the easiest way. You could also use CoreNLP's Dependency Parsing and check for a conjunction that connects two direct objects.

Dependency Parse

For the sentence described above, a simple regex might just do the trick, while Dependency Parsing might come in handy, if your sentences get more complex.

like image 170
Tobias Geiselmann Avatar answered Oct 25 '22 08:10

Tobias Geiselmann