Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse text into sentences

I'm trying to break up a paragraph into sentences. Here is my code so far:

import java.util.*;

public class StringSplit {
 public static void main(String args[]) throws Exception{
     String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
     String[] sentences = testString.split("[\\.\\!\\?]");
     for (int i=0;i<sentences.length;i++){  
         System.out.println(i);
      System.out.println(sentences[i]);  
     }  
 }
}

Two problems were found:

  1. The code splits anytime it comes to a period (".") symbol, even when it's actually one sentence. How do I prevent this?
  2. Each sentence that is split starts with a space. How do I delete the redundant space?
like image 932
user533203 Avatar asked Dec 07 '10 05:12

user533203


People also ask

How do you split a string into a sentence in Python?

split("<BRK>"); sentFile = open("./sentences. out", "w+"); for line in sentences: sentFile. write (line); sentFile. write ("\n"); sentFile.

How do you split sentences in NLP?

Splitting textual data into sentences can be considered as an easy task, where a text can be splitted to sentences by '. ' or '/n' characters.

How do you split a paragraph in a sentence?

Obviously, if we are talking about a single paragraph with a few sentences, the answer is no brainer: you do it manually by placing your cursor at the end of each sentence and pressing the ENTER key twice.

How do you split a long sentence into two sentences?

During post-processing, if one sentence ends with a question or exclamation mark followed by a double quote, and the other sentence begins with a lower case letter, then these sentences are joined together.


1 Answers

The problem you mentioned is a NLP (Natural Language Processing) problem. It is fine to write a crude rule engine but it might not scale up to support full english text.

To have a deeper insight and a java library check out this link http://nlp.stanford.edu/software/lex-parser.shtml , http://nlp.stanford.edu:8080/parser/index.jsp and similar question for ruby language How do you parse a paragraph of text into sentences? (perferrably in Ruby)

for example : The text -

The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.

after tagging becomes :

The/DT outcome/NN of/IN the/DT negotiations/NNS is/VBZ vital/JJ ,/, because/IN the/DT current/JJ tax/NN levels/NNS signed/VBN into/IN law/NN by/IN President/NNP George/NNP W./NNP Bush/NNP expire/VBP on/RP Dec./NNP 31/CD ./. Unless/IN Congress/NNP acts/VBZ ,/, tax/NN rates/NNS on/IN virtually/RB all/RB Americans/NNPS who/WP pay/VBP income/NN taxes/NNS will/MD rise/VB on/IN Jan./NNP 1/CD ./. That/DT could/MD affect/VB economic/JJ growth/NN and/CC even/RB holiday/NN sales/NNS ./. Parse

Check how it has distinguished the full stop (.) and the period after Dec. 31 ...

like image 63
Favonius Avatar answered Sep 22 '22 14:09

Favonius