Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Having both NER and RegexNER tags in StanfordCoreNLPServer output?

I am using the StanfordCoreNLPServer to extract some informations from text (such as surfaces, street names)

The street is given by a specifically trained NER model, and the surface by a simple regex via the RegexNER.

Each of them work fine separately but when used together, only the NER result is present in the output, under the ner tag. Why isn't there a regexnertag? Is there a way to also have the RegexNER result?

For information:

  • StanfordCoreNLP v3.6.0

  • the URL used:

    'http://127.0.0.1:9000/'
    '?properties={"annotators":"tokenize,ssplit,pos,ner,regexner", '
    '"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",'
    '"tokenize.language":"fr",'
    '"ner.model":"ner-model.ser.gz", ' # custom NER model with STREET labels
    '"regexner.mapping":"rules.tsv", ' # SURFACE label
    '"outputFormat": "json"}'
    

    as suggested here, the regexner annotator is after the ner, but still...

  • The current output (extract):

    {u'index': 4, u'word': u'dans', u'lemma': u'dans', u'pos': u'P', u'characterOffsetEnd': 12, u'characterOffsetBegin': 8, u'originalText': u'dans', u'ner': u'O'}
    {u'index': 5, u'word': u'la', u'lemma': u'la', u'pos': u'DET', u'characterOffsetEnd': 15, u'characterOffsetBegin': 13, u'originalText': u'la', u'ner': u'O'}
    {u'index': 6, u'word': u'rue', u'lemma': u'rue', u'pos': u'NC', u'characterOffsetEnd': 19, u'characterOffsetBegin': 16, u'originalText': u'rue', u'ner': u'STREET'}
    {u'index': 7, u'word': u'du', u'lemma': u'du', u'pos': u'P', u'characterOffsetEnd': 22, u'characterOffsetBegin': 20, u'originalText': u'du', u'ner': u'STREET'}
    [...]
    {u'index': 43, u'word': u'165', u'lemma': u'165', u'normalizedNER': u'165.0', u'pos': u'DET', u'characterOffsetEnd': 196, u'characterOffsetBegin': 193, u'originalText': u'165', u'ner': u'NUMBER'}
    {u'index': 44, u'word': u'm', u'lemma': u'm', u'pos': u'NC', u'characterOffsetEnd': 198, u'characterOffsetBegin': 197, u'originalText': u'm', u'ner': u'O'}
    {u'index': 45, u'word': u'2', u'lemma': u'2', u'normalizedNER': u'2.0', u'pos': u'ADJ', u'characterOffsetEnd': 199, u'characterOffsetBegin': 198, u'originalText': u'2', u'ner': u'NUMBER'}
    
  • Expected output : I would like the last 3 items to be labelled with SURFACE, ie the RegexNER result.

Let me know if more details are needed.

like image 535
stellasia Avatar asked Jun 17 '16 13:06

stellasia


2 Answers

Here's what the RegexNER documentation says about this:

RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.

Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

Lalor LOCATION PERSON

Labor ORGANIZATION

I'm not sure what your mapping file exactly looks like, but if it just maps entities to labels, then the original NER will label your entities as NUMBER, and RegexNER won't be able to overwrite them. If you explicitly declare that some NUMBER entities should be overwritten as SURFACE in your mapping file, then it should work.

like image 172
Emre Colak Avatar answered Nov 19 '22 18:11

Emre Colak


Ok, things seem to work as I want if I put the regexner first:

"annotators":"regexner,tokenize,ssplit,pos,ner",

seems there is an ordering problem at some stage of the process?

like image 3
stellasia Avatar answered Nov 19 '22 17:11

stellasia