Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting multi word named entities using CoreNLP

I'm using CoreNLP for named entity extraction and have run into a bit of an issue. The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo".

Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well.

The following is the code that I have so far:

    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
    props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
    pipeline = new StanfordCoreNLP(props);
    Annotation document = new Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                System.out.println(token.get(NamedEntityTagAnnotation.class));
        }
    }

Help me Obi-Wan Kenobi. You're my only hope.

like image 655
MarkB Avatar asked Mar 09 '14 11:03

MarkB


People also ask

What is Stanford NER tagger?

An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.

What does named entity extraction do?

Named entity recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.

What is nested named entity recognition?

Nested named entity recognition is a subtask of information extraction that seeks to locate and classify nested named entities (i.e., hierarchically structured entities) mentioned in unstructured text (Source: Adapted from Wikipedia).


2 Answers

PrintWriter writer = null;
 try {  
     String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.";

     String serializedClassifier = "english.all.3class.distsim.crf.ser.gz";
     AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);

     writer = new PrintWriter(new File("output.xml"));
     writer.println("<Sentences>");
     writer.flush();
     String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>"; 
     writer.println(output);
     writer.flush();
     writer.println("</Sentences>");
     writer.flush(); 
 } catch (FileNotFoundException ex) {
     ex.printStackTrace();
 } finally {
     writer.close();
 }

I was able to come up with this solution. I am writing the output to an XML file "output.xml". From the obtained output, you can merge continuous nodes in xml with "PERSON" or "ORGANIZATION" or "LOCATION" attributes in to one entity. And this format produces the word count by default.

Here is a snapshot of xml output.

<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>

From the above output you can see that continuously words are recognized as "ORGANIZATION". So these words could be combined to one entity.

like image 118
girip11 Avatar answered Oct 11 '22 10:10

girip11


I use one temp variable to hold the previous ner tag and check if the current ner tag is equal to the temp, it will combine two words together. and iteration goes by assigning temp to current ner tag.

like image 35
HappyCoding Avatar answered Oct 11 '22 10:10

HappyCoding