Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenNLP Name Finder

I am using the NameFinder API example doc of OpenNLP. After initializing the Name Finder the documentation uses the following code for the input text:

for (String document[][] : documents) {

  for (String[] sentence : document) {
    Span nameSpans[] = nameFinder.find(sentence);
    // do something with the names
  }

  nameFinder.clearAdaptiveData()
}

However when I bring this into eclipse the 'documents' (not 'document') variable is giving me an error saying the variable documents cannot be resolved. What is the documentation referring to with the 'documents' array variable? Do I need to initialize an array called 'documents' which hold txt files for this error to go away?

Thank you for your help.

like image 389
Chris Avatar asked Apr 16 '12 19:04

Chris


1 Answers

The OpenNLP documentation states that the input text should be segmented into documents, sentences and tokens. The piece of code you provided illustrates how to deal with several documents.

If you have only one document you don't need the first for, just the inner one with the array of sentences, which is composed by as an array of tokens.

To create an array of sentences from a document you can use the OpenNLP SentenceDetector, and for each sentence you can use OpenNLP Tokenizer to get the array of tokens.

Your code will look like this:

// somehow get the contents from the txt file 
//      and populate a string called documentStr

String sentences[] = sentenceDetector.sentDetect(documentStr);
for (String sentence : sentences) {
    String tokens[] = tokenizer.tokenize(sentence);
    Span nameSpans[] = nameFinder.find(tokens);
    // do something with the names
    System.out.println("Found entity: " + Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
}

You can learn how to use the SentenceDetector and the Tokenizer from OpenNLP documentation documentation.

like image 87
wcolen Avatar answered Nov 11 '22 02:11

wcolen