Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a good NER training model in OpenNLP?

I just have started with OpenNLP. I need to create a simple training model to recognize name entities.

Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,
    was named a director of this British industrial conglomerate .

The questions are two:

  • Why should i have to put the names of the persons in a text (phrase) context ? Why not write person's name one for each line? like:

    <START:person> Robert <END>
    
    <START:person> Maria <END>
    
    <START:person> John <END>
    
  • How can I also add extra information to that name? For example I would like to save the information Male/Female for each name.

(I know there are systems that try to understand it reading the last letter, like the "a" for Female etc but i would like to add it myself)

Thanks.

like image 446
Dail Avatar asked Aug 14 '15 13:08

Dail


People also ask

What is ner training?

Named Entity Recognition (NER) is a subtask that extracts information to locate entities, like person name, medical codes, location, and percentages, mentioned in unstructured data.


1 Answers

The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.

As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.

<START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group

NER= Named Entity Recognition

HTH

like image 113
Mark Giaconia Avatar answered Oct 19 '22 17:10

Mark Giaconia