I'm using CoreNLP for named entity extraction and have run into a bit of an issue. The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo". Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well. The following is the code that I have so far: <pre class="prettyprint"><code> Properties props = new Properties(); props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse"); props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"); pipeline = new StanfordCoreNLP(props); Annotation document = new Annotation(text); pipeline.annotate(document); List<CoreMap> sentences = document.get(SentencesAnnotation.class); for (CoreMap sentence : sentences) { for (CoreLabel token : sentence.get(TokensAnnotation.class)) { System.out.println(token.get(NamedEntityTagAnnotation.class)); } } </code></pre> <blockquote> Help me Obi-Wan Kenobi. You're my only hope. </blockquote>

<pre class="prettyprint"><code>PrintWriter writer = null; try { String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr."; String serializedClassifier = "english.all.3class.distsim.crf.ser.gz"; AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier); writer = new PrintWriter(new File("output.xml")); writer.println("<Sentences>"); writer.flush(); String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>"; writer.println(output); writer.flush(); writer.println("</Sentences>"); writer.flush(); } catch (FileNotFoundException ex) { ex.printStackTrace(); } finally { writer.close(); } </code></pre> I was able to come up with this solution. I am writing the output to an XML file "output.xml". From the obtained output, you can merge continuous nodes in xml with "PERSON" or "ORGANIZATION" or "LOCATION" attributes in to one entity. And this format produces the word count by default. Here is a snapshot of xml output. <pre class="prettyprint"><code><wi num="11" entity="ORGANIZATION">Federal</wi> <wi num="12" entity="ORGANIZATION">Reserve</wi> <wi num="13" entity="ORGANIZATION">Bank</wi> <wi num="14" entity="ORGANIZATION">of</wi> <wi num="15" entity="ORGANIZATION">New</wi> <wi num="16" entity="ORGANIZATION">Yorkand</wi> </code></pre> From the above output you can see that continuously words are recognized as "ORGANIZATION". So these words could be combined to one entity.

Extracting multi word named entities using CoreNLP

Tags:

stanford-nlp

named-entity-recognition

I'm using CoreNLP for named entity extraction and have run into a bit of an issue. The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo".

Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well.

The following is the code that I have so far:

    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
    props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
    pipeline = new StanfordCoreNLP(props);
    Annotation document = new Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                System.out.println(token.get(NamedEntityTagAnnotation.class));
        }
    }

Help me Obi-Wan Kenobi. You're my only hope.

655

asked Mar 09 '14 11:03

MarkB

2 Answers

PrintWriter writer = null;
 try {  
     String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.";

     String serializedClassifier = "english.all.3class.distsim.crf.ser.gz";
     AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);

     writer = new PrintWriter(new File("output.xml"));
     writer.println("<Sentences>");
     writer.flush();
     String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>"; 
     writer.println(output);
     writer.flush();
     writer.println("</Sentences>");
     writer.flush(); 
 } catch (FileNotFoundException ex) {
     ex.printStackTrace();
 } finally {
     writer.close();
 }

I was able to come up with this solution. I am writing the output to an XML file "output.xml". From the obtained output, you can merge continuous nodes in xml with "PERSON" or "ORGANIZATION" or "LOCATION" attributes in to one entity. And this format produces the word count by default.

Here is a snapshot of xml output.

<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>

From the above output you can see that continuously words are recognized as "ORGANIZATION". So these words could be combined to one entity.

118

answered Oct 11 '22 10:10

girip11

I use one temp variable to hold the previous ner tag and check if the current ner tag is equal to the temp, it will combine two words together. and iteration goes by assigning temp to current ner tag.

answered Oct 11 '22 10:10

HappyCoding

Related questions
                            
                                Stanford POS tagger in Java usage
                            
                                Error using Stanford POS Tagger in NLTK Python
                            
                                How to replace a word by its most representative mention using Stanford CoreNLP Coreferences module
                            
                                Chunking Stanford Named Entity Recognizer (NER) outputs from NLTK format
                            
                                POS tagging in Scala
                            
                                nltk StanfordNERTagger : NoClassDefFoundError: org/slf4j/LoggerFactory (In Windows)
                            
                                Executing and testing stanford core nlp example
                            
                                How to encode dependency path as a feature for classification?
                            
                                How to improve speed with Stanford NLP Tagger and NLTK
                            
                                Need a good relation extractor
                            
                                How to extract the noun phrases using Open nlp's chunking parser
                            
                                Extracting the relationship between entities in Stanford CoreNLP
                            
                                Extracting noun phrases from a text file using stanford typed parser
                            
                                EM score in SQuAD Challenge
                            
                                Scala get file path of file in resources folder
                            
                                NLTK was unable to find stanford-postagger.jar! Set the CLASSPATH environment variable
                            
                                Coreference resolution in python nltk using Stanford coreNLP
                            
                                Maven dependency:get does not download Stanford NLP model files
                            
                                How do I use python interface of Stanford NER(named entity recogniser)?
                            
                                Difference between feature selection, feature extraction, feature weights

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With