Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Result Difference in Stanford NER tagger NLTK (python) vs JAVA

I am using both python and java to run the Stanford NER tagger but I am seeing the difference in the results.

For example, when I input the sentence "Involved in all aspects of data modeling using ERwin as the primary software for this.",

JAVA Result:

"ERwin": "PERSON"

Python Result:

In [6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out [6]:[(u'Involved', u'O'),
 (u'in', u'O'),
 (u'all', u'O'),
 (u'aspects', u'O'),
 (u'of', u'O'),
 (u'data', u'O'),
 (u'modeling', u'O'),
 (u'using', u'O'),
 (u'ERwin', u'O'),
 (u'as', u'O'),
 (u'the', u'O'),
 (u'primary', u'O'),
 (u'software', u'O'),
 (u'for', u'O'),
 (u'this.', u'O')]

Python nltk wrapper can't catch "ERwin" as PERSON.

What's interesting here is both Python and Java uses the same trained data (english.all.3class.caseless.distsim.crf.ser.gz) released in 2015-04-20.

My ultimate goal is to make python work in the same way Java does.

I'm looking at StanfordNERTagger in nltk.tag to see if there's anything I can modify. Below is the wrapper code:

class StanfordNERTagger(StanfordTagger):
"""
A class for Named-Entity Tagging with Stanford Tagger. The input is the paths to:

- a model trained on training data
- (optionally) the path to the stanford tagger jar file. If not specified here,
  then this jar file must be specified in the CLASSPATH envinroment variable.
- (optionally) the encoding of the training data (default: UTF-8)

Example:

    >>> from nltk.tag import StanfordNERTagger
    >>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP
    >>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
     ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
     ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
"""

_SEPARATOR = '/'
_JAR = 'stanford-ner.jar'
_FORMAT = 'slashTags'

def __init__(self, *args, **kwargs):
    super(StanfordNERTagger, self).__init__(*args, **kwargs)

@property
def _cmd(self):
    # Adding -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false for not using stanford Tokenizer  
    return ['edu.stanford.nlp.ie.crf.CRFClassifier',
            '-loadClassifier', self._stanford_model, '-textFile',
            self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions','\"tokenizeNLs=false\"']

def parse_output(self, text, sentences):
    if self._FORMAT == 'slashTags':
        # Joint together to a big list    
        tagged_sentences = []
        for tagged_sentence in text.strip().split("\n"):
            for tagged_word in tagged_sentence.strip().split():
                word_tags = tagged_word.strip().split(self._SEPARATOR)
                tagged_sentences.append((''.join(word_tags[:-1]), word_tags[-1]))

        # Separate it according to the input
        result = []
        start = 0 
        for sent in sentences:
            result.append(tagged_sentences[start:start + len(sent)])
            start += len(sent);
        return result 

    raise NotImplementedError

Or, if it's because of using different Classifier (In java code, it seems to use AbstractSequenceClassifier, on the other hand, python nltk wrapper uses the CRFClassifier.) is there a way that I can use AbstractSequenceClassifier in python wrapper?

like image 774
aerin Avatar asked Jan 06 '16 05:01

aerin


1 Answers

Try setting maxAdditionalKnownLCWords to 0 in the properties file (or command line) for CoreNLP, and if possible for NLTK as well. This disables an option which allows the NER system to learn from test-time data a little bit, which could cause occasional mildly different results.

like image 110
Gabor Angeli Avatar answered Sep 28 '22 01:09

Gabor Angeli