I am building a 15k line training data document called: en-ner-person.train per the online manual (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html).
My question is: in my training document, do I include an entire report? Or do I only include the lines which have a name: <START:person> John Smith <END>
?
So for example do I use this entire report in my training data:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
A nonexecutive director has many similar responsibilities as an executive director.
However, there are no voting rights with this position.
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
Or do I only include these two lines in my training document:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
You should use the entire report. This would help the system to learn when not to mark an entity, improving false negatives score.
You can measure it using the evaluation tool. Reserve some sentences of your corpus for testing, for example 1/10 of the total, and train your model using the other 9/10 sentences. You can try training using the entire report and another with only the sentences with names. The results will be in terms of precision and recall.
Remember to keep the test sample with the entire report, not only the sentences with names, otherwise you will not have an accurate measure of how the model would perform with sentences without names.
I would include everything even though all of it may not contribute to the weights in the trained model.
What is or isn't used from training file is determined by the feature generator used to train the model. If you get to the point where you are actually tweaking the feature generator then you at least wouldn't need to re-build your training file if it already included everything.
This example feature generator from the documentation also happens to be the default one in the code that is used for name finders: Custom Feature Generation
AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
new AdaptiveFeatureGenerator[]{
new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator(),
new SentenceFeatureGenerator(true, false)
});
I can't fully explain that glob of code, and haven't found good documentation on it or waded through the source to understand it but the WindowFeatureGenerators there take into account the tokens and the classes of the tokens (e.g. if that token was already labeled as a person) +/-2 positions before and after the token being examined.
As such, it is possible that tokens in a sentence that doesn't contain an entity may have an impact on a sentence that does. By cropping out the extra sentences you may be training your model with unnatural patterns like a sentence ending with a name followed by a sentence that begins with the a name like this:
The car fell on <START:person> Pierre Vinken <END>. <START:person> Pierre Vinken<END> is the chairman.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With