Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a text into sentences using the Stanford parser?

How can I split a text or paragraph into sentences using Stanford parser?

Is there any method that can extract sentences, such as getSentencesFromString() as it's provided for Ruby?

like image 385
S Gaber Avatar asked Feb 29 '12 02:02

S Gaber


People also ask

How do you split text into a sentence?

Splitting textual data into sentences can be considered as an easy task, where a text can be splitted to sentences by '. ' or '/n' characters. However, in free text data this pattern is not consistent, and authors can break a line in the middle of the sentence or use “.” in wrong places.

How do you separate sentences from text in Python?

A string can be split into substrings using the split(param) method. This method is part of the string object. The parameter is optional, but you can split on a specific string or character. Given a sentence, the string can be split into words.

What is sentence breaking in NLP?

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end.


Video Answer


2 Answers

You can check the DocumentPreprocessor class. Below is a short snippet. I think there may be other ways to do what you want.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence."; Reader reader = new StringReader(paragraph); DocumentPreprocessor dp = new DocumentPreprocessor(reader); List<String> sentenceList = new ArrayList<String>();  for (List<HasWord> sentence : dp) {    // SentenceUtils not Sentence    String sentenceString = SentenceUtils.listToString(sentence);    sentenceList.add(sentenceString); }  for (String sentence : sentenceList) {    System.out.println(sentence); } 
like image 158
6 revs, 5 users 65% Avatar answered Sep 19 '22 00:09

6 revs, 5 users 65%


I know there is already an accepted answer...but typically you'd just grab the SentenceAnnotations from an annotated doc.

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution  Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);  // read some text in the text variable String text = ... // Add your text here!  // create an empty Annotation just with the given text Annotation document = new Annotation(text);  // run all Annotators on this text pipeline.annotate(document);  // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(SentencesAnnotation.class);  for(CoreMap sentence: sentences) {   // traversing the words in the current sentence   // a CoreLabel is a CoreMap with additional token-specific methods   for (CoreLabel token: sentence.get(TokensAnnotation.class)) {     // this is the text of the token     String word = token.get(TextAnnotation.class);     // this is the POS tag of the token     String pos = token.get(PartOfSpeechAnnotation.class);     // this is the NER label of the token     String ne = token.get(NamedEntityTagAnnotation.class);          }  } 

Source - http://nlp.stanford.edu/software/corenlp.shtml (half way down)

And if you're only looking for sentences, you can drop the later steps like "parse" and "dcoref" from the pipeline initialization, it'll save you some load and processing time. Rock and roll. ~K

like image 21
Kevin Avatar answered Sep 20 '22 00:09

Kevin