I am trying to split a given text into sentences using the core-nlps' DocumentPreprocessor method.
Below is the code which I'm using.
List<String> splitSentencesList = new ArrayList<>();
Reader reader = new StringReader(inputText);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
for(List<HasWord> sentence :dp){
splitSentencesList.add(Sentence.listToString(sentence).toLowerCase().replace(" .", ""));}
This works for most of the cases. But, how do we handle conjunctions within a sentence?
E.g:
I like coffee and donuts for my breakfast.
Ideally, which should be further handled as :
I like coffee for my breakfast.
I like donuts for my breakfast.
One option is to do a regex based rule to split them further. Is there any inbuilt method to achieve this in core-nlp.
any pointers on this is appreciated.
Splitting textual data into sentences can be considered as an easy task, where a text can be splitted to sentences by '. ' or '/n' characters. However, in free text data this pattern is not consistent, and authors can break a line in the middle of the sentence or use “.” in wrong places.
For splitting sentences first mark the clauses. Then make sub-clauses independent by omitting subordinating linkers and inserting subjects or other words wherever necessary. Example – When I went to Delhi I met my friend who lives there. Clause 1 (When) I went to Delhi.
To perform sentence tokenization, we can use the re. split() function. This will split the text into sentences by passing a pattern into it.
Use sent_tokenize() to split text into sentences Call nltk. tokenize.
The simple answer is: you can't do that using the DocumentPreprocessor. It is designed to split your sentences based on punctuation. There is no way to tell it to split a sentence (or rather duplicate it), when a conjunction (like and) is present.
Your idea to use a regex might just be the easiest way. You could also use CoreNLP's Dependency Parsing and check for a conjunction that connects two direct objects.
For the sentence described above, a simple regex might just do the trick, while Dependency Parsing might come in handy, if your sentences get more complex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With