I'm using CoreNLP for named entity extraction and have run into a bit of an issue. The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo".
Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well.
The following is the code that I have so far:
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.get(NamedEntityTagAnnotation.class));
}
}
Help me Obi-Wan Kenobi. You're my only hope.
An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.
Named entity recognition (NER) ‒ also called entity identification or entity extraction ‒ is a natural language processing (NLP) technique that automatically identifies named entities in a text and classifies them into predefined categories.
Nested named entity recognition is a subtask of information extraction that seeks to locate and classify nested named entities (i.e., hierarchically structured entities) mentioned in unstructured text (Source: Adapted from Wikipedia).
PrintWriter writer = null;
try {
String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.";
String serializedClassifier = "english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
writer = new PrintWriter(new File("output.xml"));
writer.println("<Sentences>");
writer.flush();
String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>";
writer.println(output);
writer.flush();
writer.println("</Sentences>");
writer.flush();
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} finally {
writer.close();
}
I was able to come up with this solution. I am writing the output to an XML file "output.xml". From the obtained output, you can merge continuous nodes in xml with "PERSON" or "ORGANIZATION" or "LOCATION" attributes in to one entity. And this format produces the word count by default.
Here is a snapshot of xml output.
<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>
From the above output you can see that continuously words are recognized as "ORGANIZATION". So these words could be combined to one entity.
I use one temp variable to hold the previous ner tag and check if the current ner tag is equal to the temp, it will combine two words together. and iteration goes by assigning temp to current ner tag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With