Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenNLP: foreign names does not get recognized

Tags:

nlp

opennlp

I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?

like image 225
Shirish Kumar Avatar asked Dec 11 '13 02:12

Shirish Kumar


1 Answers

You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.

it is very new, but it works for me.

You feed it the following:

  • a list of "known entities" via a file where each line is a name
  • a list of sentences from YOUR data via file where each line is a sentence
  • (optionally) a blacklist to remove false positives

you can checkout the addon here

https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon

you can use this to get started

import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;

public class ModelBuilderAddonUse {

  public static void main(String[] args) {
    File fileOfSentences = new File("path to your sentence file");
    File fileOfNames = new File("path to your file of person names");
    File blackListFile = new File("path to your blacklist file");
    File modelOutFile = new File("path to you where the model will be saved");
    File annotatedSentencesOutFile = new File("path to your sentence file");

    DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);


  }
}

the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.

like image 56
Mark Giaconia Avatar answered Sep 28 '22 01:09

Mark Giaconia