Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the "en-ner-person.bin" model found here In the tutorial, there are instructions on Training and creating a new model. Is there any way to "Update" the existing "en-ner-person.bin" with additional training data?
Say I have a list of 500 additional person names that are otherwise not recognized as persons - how do I generate a new model?
Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do it's best. Then it writes those results to a file of the good hits, and a file of the bad hits . Then I feed those files into the "modelbuilder-addon" call at the bottom.
To get the best results, run the class as is... then go into the known entities file and the blacklist file, and add and remove names. In other words, put names that it did not find at all, but you are aware of, into the knowns, and remove bad names from the knowns. Remove good names from the blacklist file, and add them to the knowns file. Then run the model builder part again without the first part that reads in all your data and everything. It's ok to have duplicates in the knowns and blacklist files. If you have questions let me know... it's a bit complicated
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
import opennlp.tools.entitylinker.EntityLinkerProperties;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;
public class ModelBuilderAddonUse {
//fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
private static List<String> getSentencesFromSomewhere() throws Exception {
List<String> sentences = new ArrayList<>();
int counter = 0;
DocProvider dp = new DocProvider();
String modelPath = "c:\\apache\\entitylinker\\";
EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
Map<Long, List<String>> docs = dp.getDocs(properties);
for (Long key : docs.keySet()) {
counter++;
System.out.println("\t\tDOC: " + key + "\n\n");
String docu = "";
sentences.addAll(docs.get(key));
counter++;
if(counter > 1000){
break;
}
}
return sentences;
}
public static void main(String[] args) throws Exception {
/**
* establish a file to put sentences in
*/
File sentences = new File("C:\\temp\\modelbuilder\\sentences.text");
/**
* establish a file to put your NER hits in (the ones you want to keep based
* on prob)
*/
File knownEntities = new File("C:\\temp\\modelbuilder\\knownentities.txt");
/**
* establish a BLACKLIST file to put your bad NER hits in (also can be based
* on prob)
*/
File blacklistedentities = new File("C:\\temp\\modelbuilder\\blentities.txt");
/**
* establish a file to write your annotated sentences to
*/
File annotatedSentences = new File("C:\\temp\\modelbuilder\\annotatedSentences.txt");
/**
* establish a file to write your model to
*/
File theModel = new File("C:\\temp\\modelbuilder\\theModel");
//------------create a bunch of file writers to write your results and sentences to a file
FileWriter sentenceWriter = new FileWriter(sentences, true);
FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
FileWriter knownEntityWriter = new FileWriter(knownEntities, true);
//set some thresholds to decide where to write hits, you don't have to use these at all...
double keeperThresh = .95;
double blacklistThresh = .7;
/**
* Load your model as normal
*/
TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\\temp\\opennlpmodels\\en-ner-person.zip"));
NameFinderME personFinder = new NameFinderME(personModel);
/**
* do your normal NER on the sentences you have
*/
for (String s : getSentencesFromSomewhere()) {
sentenceWriter.write(s.trim() + "\n");
sentenceWriter.flush();
String[] tokens = s.split(" ");//better to use a tokenizer really
Span[] find = personFinder.find(tokens);
double[] probs = personFinder.probs();
String[] names = Span.spansToStrings(find, tokens);
for (int i = 0; i < names.length; i++) {
//YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
if (probs[i] > keeperThresh) {
knownEntityWriter.write(names[i].trim() + "\n");
}
if (probs[i] < blacklistThresh) {
blacklistWriter.write(names[i].trim() + "\n");
}
}
personFinder.clearAdaptiveData();
blacklistWriter.flush();
knownEntityWriter.flush();
}
//flush and close all the writers
knownEntityWriter.flush();
knownEntityWriter.close();
sentenceWriter.flush();
sentenceWriter.close();
blacklistWriter.flush();
blacklistWriter.close();
/**
* THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
* KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
*/
DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
theModel, annotatedSentences, "person", 3);
}
}
this is what the console should look like ( I removed some lines for brevity here)
ITERATION: 0
Perfoming Known Entity Annotation
knowns: 625
reading data....:
writing annotated sentences....:
building model....
Building Model using 7343 annotations
reading training data...
Indexing events using cutoff of 5
Computing event counts... done. 561755 events
Indexing... done.
Sorting and merging events... done. Reduced 561755 events to 127362.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 127362
Number of Outcomes: 3
Number of Predicates: 106490
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-617150.9462211537 0.015709695507828147
2: ... loglikelihood=-90520.86903515142 0.9771288195031642
3: ... loglikelihood=-56901.86905339755 0.9771288195031642
4: ... loglikelihood=-44231.80460317638 0.9773086131854634
5: ... loglikelihood=-37222.56576767385 0.9787985865724381
6: ... loglikelihood=-32900.5623814595 0.9801924326441243
7: ... loglikelihood=-29992.881445391187 0.9829747843810914
8: ... loglikelihood=-27893.341149419102 0.9836423351817073
9: ... loglikelihood=-26296.107313900917 0.9845092611547739
10: ... loglikelihood=-25033.501573153182 0.9850682236918229
11: ... loglikelihood=-24006.060636903556 0.9856182855515305
12: ... loglikelihood=-23150.856525607975 0.9859084476328649
13: ... loglikelihood=-22425.987337392176 0.9861897090368577
14: ... loglikelihood=-21802.386362016423 0.9864211266477378
15: ... loglikelihood=-21259.20580401235 0.9865208142339632
16: ... loglikelihood=-20781.0716762281 0.9867362106256287
17: ... loglikelihood=-20356.37732369309 0.986905323495118
18: ... loglikelihood=-19976.18228587008 0.9870673158227341
19: ... loglikelihood=-19633.47877575036 0.9872097266601988
20: ... loglikelihood=-19322.689448146353 0.9873165347882974
21: ... loglikelihood=-19039.31522510173 0.9874073216971812
22: ... loglikelihood=-18779.683112448918 0.9875176900962164
23: ... loglikelihood=-18540.76222439295 0.9876316187661881
24: ... loglikelihood=-18320.027315327916 0.9877081645913254
25: ... loglikelihood=-18115.35602743375 0.9877918309583359
26: ... loglikelihood=-17924.95047403401 0.9878612562416
27: ... loglikelihood=-17747.27665623459 0.9879378020667373
28: ... loglikelihood=-17581.01712643139 0.9879947664017231
29: ... loglikelihood=-17425.03361369085 0.9880784327687337
30: ... loglikelihood=-17278.3372262906 0.9881282765618463
31: ... loglikelihood=-17140.06447937828 0.9882012621160471
32: ... loglikelihood=-17009.45784626013 0.9882546661800963
33: ... loglikelihood=-16885.84985637711 0.9883187510569554
34: ... loglikelihood=-16768.64999916476 0.9883703749855364
35: ... loglikelihood=-16657.3338665414 0.9884166585077124
36: ... loglikelihood=-16551.434095577726 0.9884558214880153
37: ... loglikelihood=-16450.532769374073 0.9885074454165962
38: ... loglikelihood=-16354.255007222264 0.9885448282614306
39: ... loglikelihood=-16262.263530858221 0.9885733104289236
40: ... loglikelihood=-16174.254036589966 0.9886391754412511
41: ... loglikelihood=-16089.951236435176 0.9886765582860856
42: ... loglikelihood=-16009.105457548561 0.9887281822146665
43: ... loglikelihood=-15931.489709807445 0.988747763704818
44: ... loglikelihood=-15856.897147780543 0.9887798061432475
45: ... loglikelihood=-15785.138866385483 0.9888065081752722
46: ... loglikelihood=-15716.041980029182 0.9888349903427651
47: ... loglikelihood=-15649.447943527766 0.9888581321038531
48: ... loglikelihood=-15585.211079986258 0.9888901745422827
49: ... loglikelihood=-15523.19728647256 0.9889328977935221
50: ... loglikelihood=-15463.282892914636 0.9889595998255467
51: ... loglikelihood=-15405.353653492159 0.9889685005028883
52: ... loglikelihood=-15349.303852923775 0.9889809614511664
53: ... loglikelihood=-15295.035512678789 0.9889934223994445
54: ... loglikelihood=-15242.457684348112 0.989013003889596
55: ... loglikelihood=-15191.485819217298 0.9890236847024059
56: ... loglikelihood=-15142.041204645499 0.9890397059216206
57: ... loglikelihood=-15094.050459152337 0.9890539470053671
58: ... loglikelihood=-15047.445079207273 0.9890592874117721
59: ... loglikelihood=-15002.161031666768 0.9890753086309868
60: ... loglikelihood=-14958.13838658306 0.9890966702566065
61: ... loglikelihood=-14915.320985817205 0.9891180318822262
62: ... loglikelihood=-14873.656143433394 0.9891269325595677
63: ... loglikelihood=-14833.094374397517 0.9891500743206558
64: ... loglikelihood=-14793.589148498404 0.9891589749979973
65: ... loglikelihood=-14755.096666806796 0.9891785564881488
66: ... loglikelihood=-14717.5756582924 0.9891892373009586
67: ... loglikelihood=-14680.98719451864 0.9891892373009586
68: ... loglikelihood=-14645.294520562966 0.9891945777073635
69: ... loglikelihood=-14610.462900520715 0.9891999181137685
70: ... loglikelihood=-14576.45947616036 0.989214159197515
71: ... loglikelihood=-14543.25313742511 0.9892212797393881
72: ... loglikelihood=-14510.814403643026 0.9892230598748565
73: ... loglikelihood=-14479.115314429962 0.9892230598748565
74: ... loglikelihood=-14448.129329357815 0.9892426413650078
75: ... loglikelihood=-14417.831235594616 0.9892515420423494
76: ... loglikelihood=-14388.19706276905 0.9892622228551593
77: ... loglikelihood=-14359.204004414 0.9892711235325008
78: ... loglikelihood=-14330.8303454032 0.9892764639389058
79: ... loglikelihood=-14303.055394843146 0.9892764639389058
80: ... loglikelihood=-14275.859423957678 0.9892924851581205
81: ... loglikelihood=-14249.223608524193 0.9893013858354621
82: ... loglikelihood=-14223.129975482772 0.9893209673256135
83: ... loglikelihood=-14197.561353359844 0.9893263077320185
84: ... loglikelihood=-14172.50132620183 0.9893280878674867
85: ... loglikelihood=-14147.934190713178 0.9893263077320185
86: ... loglikelihood=-14123.84491635766 0.9893316481384233
87: ... loglikelihood=-14100.21910816809 0.9894313357246487
88: ... loglikelihood=-14077.042972066316 0.989433115860117
89: ... loglikelihood=-14054.303282478262 0.9894437966729268
90: ... loglikelihood=-14031.987352086799 0.9894580377566733
91: ... loglikelihood=-14010.083003539214 0.9894615980276099
92: ... loglikelihood=-13988.578542971209 0.9894776192468246
93: ... loglikelihood=-13967.46273521311 0.9894811795177613
94: ... loglikelihood=-13946.724780546094 0.9894829596532296
95: ... loglikelihood=-13926.354292898612 0.9894829596532296
96: ... loglikelihood=-13906.341279379953 0.9894900801951029
97: ... loglikelihood=-13886.676121050288 0.9894936404660395
98: ... loglikelihood=-13867.34955484593 0.9894954206015077
99: ... loglikelihood=-13848.35265657199 0.9894954206015077
100: ... loglikelihood=-13829.676824889664 0.9894972007369761
model generated
model building complete....
annotated sentences: 7343
Performing NER with new model
Printing NER Results. Add undesired results to the blacklist file and start over
//prints some names
annotated sentences: 7369
knowns: 651
ITERATION: 1
Perfoming Known Entity Annotation
knowns: 651
reading data....:
writing annotated sentences....:
building model....
Building Model using 20370 annotations
reading training data...
Indexing events using cutoff of 5
Computing event counts... done. 1116781 events
Indexing... done.
Sorting and merging events... done. Reduced 1116781 events to 288251.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 288251
Number of Outcomes: 3
Number of Predicates: 206399
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-1226909.3303549637 0.03418485808766446
2: ... loglikelihood=-196688.7107544095 0.9622047653031346
3: ... loglikelihood=-138615.22912914792 0.9651462551744702
4: ... loglikelihood=-114777.09879832959 0.9697075791941303
5: ... loglikelihood=-101055.0229949508 0.9716443958126079
6: ... loglikelihood=-92253.8923255943 0.973049326591337
7: ... loglikelihood=-86146.35307405592 0.9750121107003074
8: ... loglikelihood=-81641.85792288609 0.975682788299586
9: ... loglikelihood=-78164.62963136223 0.9762594456746667
10: ... loglikelihood=-75386.40867917785 0.9767044747358703
11: ... loglikelihood=-73106.85371375803 0.9770590652957025
12: ... loglikelihood=-71196.60721959372 0.9774718588514668
13: ... loglikelihood=-69568.23683712543 0.9777279520335679
14: ... loglikelihood=-68160.39924327709 0.9779374828189233
15: ... loglikelihood=-66928.70260893498 0.9780914969004666
16: ... loglikelihood=-65840.17418566217 0.9782661058882628
17: ... loglikelihood=-64869.77222395241 0.9784040022170865
18: ... loglikelihood=-63998.109674075415 0.9785159310554173
19: ... loglikelihood=-63209.92394252923 0.9786475593692944
20: ... loglikelihood=-62493.02131098982 0.9787505339005589
21: ... loglikelihood=-61837.53211219312 0.9788597764467698
22: ... loglikelihood=-61235.37451190329 0.9789457377946079
23: ... loglikelihood=-60679.86146007204 0.9790003590677133
24: ... loglikelihood=-60165.407875448924 0.979062143786472
25: ... loglikelihood=-59687.30928567587 0.9791346736737104
26: ... loglikelihood=-59241.572255584455 0.979201830976709
27: ... loglikelihood=-58824.78291785096 0.9792698837104141
28: ... loglikelihood=-58434.00392167818 0.979333459290586
29: ... loglikelihood=-58066.69284046825 0.979381812548745
30: ... loglikelihood=-57720.63696783972 0.9794355383911438
31: ... loglikelihood=-57393.9007602091 0.9795089637090889
32: ... loglikelihood=-57084.78313293037 0.9795483626601814
33: ... loglikelihood=-56791.78250307578 0.9795743301506741
34: ... loglikelihood=-56513.567973701254 0.9796298468544863
35: ... loglikelihood=-56248.955425711436 0.9796808864047651
36: ... loglikelihood=-55996.887560355084 0.9797202853558576
37: ... loglikelihood=-55756.41714443519 0.9797543117227102
38: ... loglikelihood=-55526.69286884015 0.9797963969659226
39: ... loglikelihood=-55306.94735282102 0.9798152010107621
40: ... loglikelihood=-55096.48692031122 0.9798563908232679
41: ... loglikelihood=-54894.68284780714 0.9799029532200136
42: ... loglikelihood=-54700.963840494 0.9799378750175728
43: ... loglikelihood=-54514.80953871555 0.9799656333694788
44: ... loglikelihood=-54335.744892614406 0.9800005551670381
45: ... loglikelihood=-54163.33527156895 0.9800301043803574
46: ... loglikelihood=-53997.182198154995 0.9800551764401436
47: ... loglikelihood=-53836.91961491415 0.980082039361343
48: ... loglikelihood=-53682.210607423985 0.980112484005369
49: ... loglikelihood=-53532.74451955152 0.980140242357275
50: ... loglikelihood=-53388.23440690913 0.9801688961398878
51: ... loglikelihood=-53248.41478285541 0.9801921773382606
52: ... loglikelihood=-53113.03961847529 0.9802109813831001
53: ... loglikelihood=-52981.880563479055 0.9802351580121796
54: ... loglikelihood=-52854.7253600851 0.9802584392105524
55: ... loglikelihood=-52731.37642565477 0.9802727661018589
56: ... loglikelihood=-52611.64958353087 0.9803005244537649
57: ... loglikelihood=-52495.37292415569 0.9803148513450712
58: ... loglikelihood=-52382.38578113555 0.9803470868505105
59: ... loglikelihood=-52272.53780883427 0.9803748452024166
60: ... loglikelihood=-52165.68814994865 0.9803891720937229
61: ... loglikelihood=-52061.7046829472 0.9804043944157359
62: ... loglikelihood=-51960.46334051503 0.9804151395842157
63: ... loglikelihood=-51861.84749132724 0.9804393162132952
64: ... loglikelihood=-51765.74737831825 0.9804491659510683
65: ... loglikelihood=-51672.05960757943 0.9804634928423747
66: ... loglikelihood=-51580.686682513515 0.9804876694714542
67: ... loglikelihood=-51491.53657871175 0.9805046826548804
68: ... loglikelihood=-51404.52235540815 0.9805172186847735
69: ... loglikelihood=-51319.56179989248 0.9805315455760798
70: ... loglikelihood=-51236.577101627925 0.9805440816059728
71: ... loglikelihood=-51155.494553260556 0.9805584084972793
72: ... loglikelihood=-51076.24427590388 0.980569153665759
73: ... loglikelihood=-50998.75996642977 0.9805825851263587
74: ... loglikelihood=-50922.97866477339 0.9805951211562518
75: ... loglikelihood=-50848.84053937224 0.9806112389089714
76: ... loglikelihood=-50776.28868909037 0.9806264612309844
77: ... loglikelihood=-50705.2689602481 0.9806389972608774
78: ... loglikelihood=-50635.729777298875 0.9806470561372372
79: ... loglikelihood=-50567.62198610024 0.9806658601820769
80: ... loglikelihood=-50500.8987085974 0.9806685464741968
81: ... loglikelihood=-50435.51520800019 0.9806775007812633
82: ... loglikelihood=-50371.42876358994 0.9806837687962098
83: ... loglikelihood=-50308.59855431275 0.9806918276725697
84: ... loglikelihood=-50246.98555046764 0.9806989911182228
85: ... loglikelihood=-50186.55241287111 0.980703468271756
86: ... loglikelihood=-50127.26339882067 0.9807195860244757
87: ... loglikelihood=-50069.08427441567 0.9807312266236621
88: ... loglikelihood=-50011.9822326526 0.9807357037771953
89: ... loglikelihood=-49955.92581691934 0.9807446580842618
90: ... loglikelihood=-49900.88484943885 0.9807527169606216
91: ... loglikelihood=-49846.83036430355 0.9807634621291014
92: ... loglikelihood=-49793.734544757914 0.9807724164361679
93: ... loglikelihood=-49741.57066440427 0.9807786844511144
94: ... loglikelihood=-49690.31303207665 0.9807840570353543
95: ... loglikelihood=-49639.93694007888 0.9807948022038341
96: ... loglikelihood=-49590.418615580194 0.9808001747880739
97: ... loglikelihood=-49541.73517492774 0.9808073382337271
98: ... loglikelihood=-49493.86458067577 0.9808145016793803
99: ... loglikelihood=-49446.785601155134 0.9808234559864467
100: ... loglikelihood=-49400.477772387036 0.9808359920163399
model generated
model building complete....
annotated sentences: 20370
Performing NER with new model
it will do this for each iteration util you see
......
97: ... loglikelihood=-49140.50129715517 0.9808462362240823
98: ... loglikelihood=-49095.42289306763 0.9808641444693966
99: ... loglikelihood=-49051.095083380205 0.9808713077675223
100: ... loglikelihood=-49007.49834809576 0.9808748894165852
model generated
you can change the num iterations if you see the annotated sentences stop changing, and the knowns stop changing on subsequent runs as you refine the lists.
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With