Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to "update" an existing Named Entity Recognition model - rather than creating from scratch?

Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data?

Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model?

like image 230
sky Avatar asked Feb 07 '14 01:02

sky


1 Answers

Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do it's best. Then it writes those results to a file of the good hits, and a file of the bad hits . Then I feed those files into the "modelbuilder-addon" call at the bottom.

To get the best results, run the class as is... then go into the known entities file and the blacklist file, and add and remove names. In other words, put names that it did not find at all, but you are aware of, into the knowns, and remove bad names from the knowns. Remove good names from the blacklist file, and add them to the knowns file. Then run the model builder part again without the first part that reads in all your data and everything. It's ok to have duplicates in the knowns and blacklist files. If you have questions let me know... it's a bit complicated

import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
import opennlp.tools.entitylinker.EntityLinkerProperties;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;

public class ModelBuilderAddonUse {
//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
  private static List<String> getSentencesFromSomewhere() throws Exception {
    List<String> sentences = new ArrayList<>();
    int counter = 0;
    DocProvider dp = new DocProvider();
    String modelPath = "c:\\apache\\entitylinker\\";
    EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
    Map<Long, List<String>> docs = dp.getDocs(properties);
    for (Long key : docs.keySet()) {
      counter++;
      System.out.println("\t\tDOC: " + key + "\n\n");
      String docu = "";
      sentences.addAll(docs.get(key));
      counter++;
      if(counter > 1000){
        break;
      }
    }
    return sentences;
  }

  public static void main(String[] args) throws Exception {
    /**
     * establish a file to put sentences in
     */
    File sentences = new File("C:\\temp\\modelbuilder\\sentences.text");

    /**
     * establish a file to put your NER hits in (the ones you want to keep based
     * on prob)
     */
    File knownEntities = new File("C:\\temp\\modelbuilder\\knownentities.txt");

    /**
     * establish a BLACKLIST file to put your bad NER hits in (also can be based
     * on prob)
     */
    File blacklistedentities = new File("C:\\temp\\modelbuilder\\blentities.txt");

    /**
     * establish a file to write your annotated sentences to
     */
    File annotatedSentences = new File("C:\\temp\\modelbuilder\\annotatedSentences.txt");

    /**
     * establish a file to write your model to
     */
    File theModel = new File("C:\\temp\\modelbuilder\\theModel");


//------------create a bunch of file writers to write your results and sentences to a file

    FileWriter sentenceWriter = new FileWriter(sentences, true);
    FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
    FileWriter knownEntityWriter = new FileWriter(knownEntities, true);

//set some thresholds to decide where to write hits, you don't have to use these at all...
    double keeperThresh = .95;
    double blacklistThresh = .7;


    /**
     * Load your model as normal
     */
    TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\\temp\\opennlpmodels\\en-ner-person.zip"));
    NameFinderME personFinder = new NameFinderME(personModel);
    /**
     * do your normal NER on the sentences you have
     */
    for (String s : getSentencesFromSomewhere()) {
      sentenceWriter.write(s.trim() + "\n");
      sentenceWriter.flush();

      String[] tokens = s.split(" ");//better to use a tokenizer really
      Span[] find = personFinder.find(tokens);
      double[] probs = personFinder.probs();
      String[] names = Span.spansToStrings(find, tokens);
      for (int i = 0; i < names.length; i++) {
        //YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
        if (probs[i] > keeperThresh) {
          knownEntityWriter.write(names[i].trim() + "\n");
        }
        if (probs[i] < blacklistThresh) {
          blacklistWriter.write(names[i].trim() + "\n");
        }
      }
      personFinder.clearAdaptiveData();
      blacklistWriter.flush();
      knownEntityWriter.flush();
    }
    //flush and close all the writers
    knownEntityWriter.flush();
    knownEntityWriter.close();
    sentenceWriter.flush();
    sentenceWriter.close();
    blacklistWriter.flush();
    blacklistWriter.close();

    /**
     * THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
     * KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
     */
    DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
            theModel, annotatedSentences, "person", 3);


  }
}

this is what the console should look like ( I removed some lines for brevity here)

ITERATION: 0
    Perfoming Known Entity Annotation
        knowns: 625
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 7343 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 561755 events
    Indexing...  done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 127362
        Number of Outcomes: 3
      Number of Predicates: 106490
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-617150.9462211537  0.015709695507828147
  2:  ... loglikelihood=-90520.86903515142  0.9771288195031642
  3:  ... loglikelihood=-56901.86905339755  0.9771288195031642
  4:  ... loglikelihood=-44231.80460317638  0.9773086131854634
  5:  ... loglikelihood=-37222.56576767385  0.9787985865724381
  6:  ... loglikelihood=-32900.5623814595   0.9801924326441243
  7:  ... loglikelihood=-29992.881445391187 0.9829747843810914
  8:  ... loglikelihood=-27893.341149419102 0.9836423351817073
  9:  ... loglikelihood=-26296.107313900917 0.9845092611547739
 10:  ... loglikelihood=-25033.501573153182 0.9850682236918229
 11:  ... loglikelihood=-24006.060636903556 0.9856182855515305
 12:  ... loglikelihood=-23150.856525607975 0.9859084476328649
 13:  ... loglikelihood=-22425.987337392176 0.9861897090368577
 14:  ... loglikelihood=-21802.386362016423 0.9864211266477378
 15:  ... loglikelihood=-21259.20580401235  0.9865208142339632
 16:  ... loglikelihood=-20781.0716762281   0.9867362106256287
 17:  ... loglikelihood=-20356.37732369309  0.986905323495118
 18:  ... loglikelihood=-19976.18228587008  0.9870673158227341
 19:  ... loglikelihood=-19633.47877575036  0.9872097266601988
 20:  ... loglikelihood=-19322.689448146353 0.9873165347882974
 21:  ... loglikelihood=-19039.31522510173  0.9874073216971812
 22:  ... loglikelihood=-18779.683112448918 0.9875176900962164
 23:  ... loglikelihood=-18540.76222439295  0.9876316187661881
 24:  ... loglikelihood=-18320.027315327916 0.9877081645913254
 25:  ... loglikelihood=-18115.35602743375  0.9877918309583359
 26:  ... loglikelihood=-17924.95047403401  0.9878612562416
 27:  ... loglikelihood=-17747.27665623459  0.9879378020667373
 28:  ... loglikelihood=-17581.01712643139  0.9879947664017231
 29:  ... loglikelihood=-17425.03361369085  0.9880784327687337
 30:  ... loglikelihood=-17278.3372262906   0.9881282765618463
 31:  ... loglikelihood=-17140.06447937828  0.9882012621160471
 32:  ... loglikelihood=-17009.45784626013  0.9882546661800963
 33:  ... loglikelihood=-16885.84985637711  0.9883187510569554
 34:  ... loglikelihood=-16768.64999916476  0.9883703749855364
 35:  ... loglikelihood=-16657.3338665414   0.9884166585077124
 36:  ... loglikelihood=-16551.434095577726 0.9884558214880153
 37:  ... loglikelihood=-16450.532769374073 0.9885074454165962
 38:  ... loglikelihood=-16354.255007222264 0.9885448282614306
 39:  ... loglikelihood=-16262.263530858221 0.9885733104289236
 40:  ... loglikelihood=-16174.254036589966 0.9886391754412511
 41:  ... loglikelihood=-16089.951236435176 0.9886765582860856
 42:  ... loglikelihood=-16009.105457548561 0.9887281822146665
 43:  ... loglikelihood=-15931.489709807445 0.988747763704818
 44:  ... loglikelihood=-15856.897147780543 0.9887798061432475
 45:  ... loglikelihood=-15785.138866385483 0.9888065081752722
 46:  ... loglikelihood=-15716.041980029182 0.9888349903427651
 47:  ... loglikelihood=-15649.447943527766 0.9888581321038531
 48:  ... loglikelihood=-15585.211079986258 0.9888901745422827
 49:  ... loglikelihood=-15523.19728647256  0.9889328977935221
 50:  ... loglikelihood=-15463.282892914636 0.9889595998255467
 51:  ... loglikelihood=-15405.353653492159 0.9889685005028883
 52:  ... loglikelihood=-15349.303852923775 0.9889809614511664
 53:  ... loglikelihood=-15295.035512678789 0.9889934223994445
 54:  ... loglikelihood=-15242.457684348112 0.989013003889596
 55:  ... loglikelihood=-15191.485819217298 0.9890236847024059
 56:  ... loglikelihood=-15142.041204645499 0.9890397059216206
 57:  ... loglikelihood=-15094.050459152337 0.9890539470053671
 58:  ... loglikelihood=-15047.445079207273 0.9890592874117721
 59:  ... loglikelihood=-15002.161031666768 0.9890753086309868
 60:  ... loglikelihood=-14958.13838658306  0.9890966702566065
 61:  ... loglikelihood=-14915.320985817205 0.9891180318822262
 62:  ... loglikelihood=-14873.656143433394 0.9891269325595677
 63:  ... loglikelihood=-14833.094374397517 0.9891500743206558
 64:  ... loglikelihood=-14793.589148498404 0.9891589749979973
 65:  ... loglikelihood=-14755.096666806796 0.9891785564881488
 66:  ... loglikelihood=-14717.5756582924   0.9891892373009586
 67:  ... loglikelihood=-14680.98719451864  0.9891892373009586
 68:  ... loglikelihood=-14645.294520562966 0.9891945777073635
 69:  ... loglikelihood=-14610.462900520715 0.9891999181137685
 70:  ... loglikelihood=-14576.45947616036  0.989214159197515
 71:  ... loglikelihood=-14543.25313742511  0.9892212797393881
 72:  ... loglikelihood=-14510.814403643026 0.9892230598748565
 73:  ... loglikelihood=-14479.115314429962 0.9892230598748565
 74:  ... loglikelihood=-14448.129329357815 0.9892426413650078
 75:  ... loglikelihood=-14417.831235594616 0.9892515420423494
 76:  ... loglikelihood=-14388.19706276905  0.9892622228551593
 77:  ... loglikelihood=-14359.204004414    0.9892711235325008
 78:  ... loglikelihood=-14330.8303454032   0.9892764639389058
 79:  ... loglikelihood=-14303.055394843146 0.9892764639389058
 80:  ... loglikelihood=-14275.859423957678 0.9892924851581205
 81:  ... loglikelihood=-14249.223608524193 0.9893013858354621
 82:  ... loglikelihood=-14223.129975482772 0.9893209673256135
 83:  ... loglikelihood=-14197.561353359844 0.9893263077320185
 84:  ... loglikelihood=-14172.50132620183  0.9893280878674867
 85:  ... loglikelihood=-14147.934190713178 0.9893263077320185
 86:  ... loglikelihood=-14123.84491635766  0.9893316481384233
 87:  ... loglikelihood=-14100.21910816809  0.9894313357246487
 88:  ... loglikelihood=-14077.042972066316 0.989433115860117
 89:  ... loglikelihood=-14054.303282478262 0.9894437966729268
 90:  ... loglikelihood=-14031.987352086799 0.9894580377566733
 91:  ... loglikelihood=-14010.083003539214 0.9894615980276099
 92:  ... loglikelihood=-13988.578542971209 0.9894776192468246
 93:  ... loglikelihood=-13967.46273521311  0.9894811795177613
 94:  ... loglikelihood=-13946.724780546094 0.9894829596532296
 95:  ... loglikelihood=-13926.354292898612 0.9894829596532296
 96:  ... loglikelihood=-13906.341279379953 0.9894900801951029
 97:  ... loglikelihood=-13886.676121050288 0.9894936404660395
 98:  ... loglikelihood=-13867.34955484593  0.9894954206015077
 99:  ... loglikelihood=-13848.35265657199  0.9894954206015077
100:  ... loglikelihood=-13829.676824889664 0.9894972007369761
    model generated
        model building complete.... 
        annotated sentences: 7343
    Performing NER with new model
        Printing NER Results. Add undesired results to the blacklist file and start over

//prints some names

    annotated sentences: 7369
        knowns: 651
ITERATION: 1
    Perfoming Known Entity Annotation
        knowns: 651
        reading data....: 
        writing annotated sentences....: 
        building model.... 
    Building Model using 20370 annotations
        reading training data...
Indexing events using cutoff of 5

    Computing event counts...  done. 1116781 events
    Indexing...  done.
Sorting and merging events... done. Reduced 1116781 events to 288251.
Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 288251
        Number of Outcomes: 3
      Number of Predicates: 206399
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-1226909.3303549637 0.03418485808766446
  2:  ... loglikelihood=-196688.7107544095  0.9622047653031346
  3:  ... loglikelihood=-138615.22912914792 0.9651462551744702
  4:  ... loglikelihood=-114777.09879832959 0.9697075791941303
  5:  ... loglikelihood=-101055.0229949508  0.9716443958126079
  6:  ... loglikelihood=-92253.8923255943   0.973049326591337
  7:  ... loglikelihood=-86146.35307405592  0.9750121107003074
  8:  ... loglikelihood=-81641.85792288609  0.975682788299586
  9:  ... loglikelihood=-78164.62963136223  0.9762594456746667
 10:  ... loglikelihood=-75386.40867917785  0.9767044747358703
 11:  ... loglikelihood=-73106.85371375803  0.9770590652957025
 12:  ... loglikelihood=-71196.60721959372  0.9774718588514668
 13:  ... loglikelihood=-69568.23683712543  0.9777279520335679
 14:  ... loglikelihood=-68160.39924327709  0.9779374828189233
 15:  ... loglikelihood=-66928.70260893498  0.9780914969004666
 16:  ... loglikelihood=-65840.17418566217  0.9782661058882628
 17:  ... loglikelihood=-64869.77222395241  0.9784040022170865
 18:  ... loglikelihood=-63998.109674075415 0.9785159310554173
 19:  ... loglikelihood=-63209.92394252923  0.9786475593692944
 20:  ... loglikelihood=-62493.02131098982  0.9787505339005589
 21:  ... loglikelihood=-61837.53211219312  0.9788597764467698
 22:  ... loglikelihood=-61235.37451190329  0.9789457377946079
 23:  ... loglikelihood=-60679.86146007204  0.9790003590677133
 24:  ... loglikelihood=-60165.407875448924 0.979062143786472
 25:  ... loglikelihood=-59687.30928567587  0.9791346736737104
 26:  ... loglikelihood=-59241.572255584455 0.979201830976709
 27:  ... loglikelihood=-58824.78291785096  0.9792698837104141
 28:  ... loglikelihood=-58434.00392167818  0.979333459290586
 29:  ... loglikelihood=-58066.69284046825  0.979381812548745
 30:  ... loglikelihood=-57720.63696783972  0.9794355383911438
 31:  ... loglikelihood=-57393.9007602091   0.9795089637090889
 32:  ... loglikelihood=-57084.78313293037  0.9795483626601814
 33:  ... loglikelihood=-56791.78250307578  0.9795743301506741
 34:  ... loglikelihood=-56513.567973701254 0.9796298468544863
 35:  ... loglikelihood=-56248.955425711436 0.9796808864047651
 36:  ... loglikelihood=-55996.887560355084 0.9797202853558576
 37:  ... loglikelihood=-55756.41714443519  0.9797543117227102
 38:  ... loglikelihood=-55526.69286884015  0.9797963969659226
 39:  ... loglikelihood=-55306.94735282102  0.9798152010107621
 40:  ... loglikelihood=-55096.48692031122  0.9798563908232679
 41:  ... loglikelihood=-54894.68284780714  0.9799029532200136
 42:  ... loglikelihood=-54700.963840494    0.9799378750175728
 43:  ... loglikelihood=-54514.80953871555  0.9799656333694788
 44:  ... loglikelihood=-54335.744892614406 0.9800005551670381
 45:  ... loglikelihood=-54163.33527156895  0.9800301043803574
 46:  ... loglikelihood=-53997.182198154995 0.9800551764401436
 47:  ... loglikelihood=-53836.91961491415  0.980082039361343
 48:  ... loglikelihood=-53682.210607423985 0.980112484005369
 49:  ... loglikelihood=-53532.74451955152  0.980140242357275
 50:  ... loglikelihood=-53388.23440690913  0.9801688961398878
 51:  ... loglikelihood=-53248.41478285541  0.9801921773382606
 52:  ... loglikelihood=-53113.03961847529  0.9802109813831001
 53:  ... loglikelihood=-52981.880563479055 0.9802351580121796
 54:  ... loglikelihood=-52854.7253600851   0.9802584392105524
 55:  ... loglikelihood=-52731.37642565477  0.9802727661018589
 56:  ... loglikelihood=-52611.64958353087  0.9803005244537649
 57:  ... loglikelihood=-52495.37292415569  0.9803148513450712
 58:  ... loglikelihood=-52382.38578113555  0.9803470868505105
 59:  ... loglikelihood=-52272.53780883427  0.9803748452024166
 60:  ... loglikelihood=-52165.68814994865  0.9803891720937229
 61:  ... loglikelihood=-52061.7046829472   0.9804043944157359
 62:  ... loglikelihood=-51960.46334051503  0.9804151395842157
 63:  ... loglikelihood=-51861.84749132724  0.9804393162132952
 64:  ... loglikelihood=-51765.74737831825  0.9804491659510683
 65:  ... loglikelihood=-51672.05960757943  0.9804634928423747
 66:  ... loglikelihood=-51580.686682513515 0.9804876694714542
 67:  ... loglikelihood=-51491.53657871175  0.9805046826548804
 68:  ... loglikelihood=-51404.52235540815  0.9805172186847735
 69:  ... loglikelihood=-51319.56179989248  0.9805315455760798
 70:  ... loglikelihood=-51236.577101627925 0.9805440816059728
 71:  ... loglikelihood=-51155.494553260556 0.9805584084972793
 72:  ... loglikelihood=-51076.24427590388  0.980569153665759
 73:  ... loglikelihood=-50998.75996642977  0.9805825851263587
 74:  ... loglikelihood=-50922.97866477339  0.9805951211562518
 75:  ... loglikelihood=-50848.84053937224  0.9806112389089714
 76:  ... loglikelihood=-50776.28868909037  0.9806264612309844
 77:  ... loglikelihood=-50705.2689602481   0.9806389972608774
 78:  ... loglikelihood=-50635.729777298875 0.9806470561372372
 79:  ... loglikelihood=-50567.62198610024  0.9806658601820769
 80:  ... loglikelihood=-50500.8987085974   0.9806685464741968
 81:  ... loglikelihood=-50435.51520800019  0.9806775007812633
 82:  ... loglikelihood=-50371.42876358994  0.9806837687962098
 83:  ... loglikelihood=-50308.59855431275  0.9806918276725697
 84:  ... loglikelihood=-50246.98555046764  0.9806989911182228
 85:  ... loglikelihood=-50186.55241287111  0.980703468271756
 86:  ... loglikelihood=-50127.26339882067  0.9807195860244757
 87:  ... loglikelihood=-50069.08427441567  0.9807312266236621
 88:  ... loglikelihood=-50011.9822326526   0.9807357037771953
 89:  ... loglikelihood=-49955.92581691934  0.9807446580842618
 90:  ... loglikelihood=-49900.88484943885  0.9807527169606216
 91:  ... loglikelihood=-49846.83036430355  0.9807634621291014
 92:  ... loglikelihood=-49793.734544757914 0.9807724164361679
 93:  ... loglikelihood=-49741.57066440427  0.9807786844511144
 94:  ... loglikelihood=-49690.31303207665  0.9807840570353543
 95:  ... loglikelihood=-49639.93694007888  0.9807948022038341
 96:  ... loglikelihood=-49590.418615580194 0.9808001747880739
 97:  ... loglikelihood=-49541.73517492774  0.9808073382337271
 98:  ... loglikelihood=-49493.86458067577  0.9808145016793803
 99:  ... loglikelihood=-49446.785601155134 0.9808234559864467
100:  ... loglikelihood=-49400.477772387036 0.9808359920163399
    model generated
        model building complete.... 
        annotated sentences: 20370
    Performing NER with new model


it will do this for each iteration  util you see
......
 97:  ... loglikelihood=-49140.50129715517  0.9808462362240823
 98:  ... loglikelihood=-49095.42289306763  0.9808641444693966
 99:  ... loglikelihood=-49051.095083380205 0.9808713077675223
100:  ... loglikelihood=-49007.49834809576  0.9808748894165852
    model generated

you can change the num iterations if you see the annotated sentences stop changing, and the knowns stop changing on subsequent runs as you refine the lists.

HTH

like image 71
Mark Giaconia Avatar answered Sep 22 '22 12:09

Mark Giaconia