Having both NER and RegexNER tags in StanfordCoreNLPServer output?

Question

I am using the StanfordCoreNLPServer to extract some informations from text (such as surfaces, street names)

The street is given by a specifically trained NER model, and the surface by a simple regex via the RegexNER.

Each of them work fine separately but when used together, only the NER result is present in the output, under the ner tag. Why isn't there a regexnertag? Is there a way to also have the RegexNER result?

For information:

StanfordCoreNLP v3.6.0

the URL used:

'http://127.0.0.1:9000/'
'?properties={"annotators":"tokenize,ssplit,pos,ner,regexner", '
'"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",'
'"tokenize.language":"fr",'
'"ner.model":"ner-model.ser.gz", ' # custom NER model with STREET labels
'"regexner.mapping":"rules.tsv", ' # SURFACE label
'"outputFormat": "json"}'

as suggested here, the regexner annotator is after the ner, but still...

The current output (extract):

{u'index': 4, u'word': u'dans', u'lemma': u'dans', u'pos': u'P', u'characterOffsetEnd': 12, u'characterOffsetBegin': 8, u'originalText': u'dans', u'ner': u'O'}
{u'index': 5, u'word': u'la', u'lemma': u'la', u'pos': u'DET', u'characterOffsetEnd': 15, u'characterOffsetBegin': 13, u'originalText': u'la', u'ner': u'O'}
{u'index': 6, u'word': u'rue', u'lemma': u'rue', u'pos': u'NC', u'characterOffsetEnd': 19, u'characterOffsetBegin': 16, u'originalText': u'rue', u'ner': u'STREET'}
{u'index': 7, u'word': u'du', u'lemma': u'du', u'pos': u'P', u'characterOffsetEnd': 22, u'characterOffsetBegin': 20, u'originalText': u'du', u'ner': u'STREET'}
[...]
{u'index': 43, u'word': u'165', u'lemma': u'165', u'normalizedNER': u'165.0', u'pos': u'DET', u'characterOffsetEnd': 196, u'characterOffsetBegin': 193, u'originalText': u'165', u'ner': u'NUMBER'}
{u'index': 44, u'word': u'm', u'lemma': u'm', u'pos': u'NC', u'characterOffsetEnd': 198, u'characterOffsetBegin': 197, u'originalText': u'm', u'ner': u'O'}
{u'index': 45, u'word': u'2', u'lemma': u'2', u'normalizedNER': u'2.0', u'pos': u'ADJ', u'characterOffsetEnd': 199, u'characterOffsetBegin': 198, u'originalText': u'2', u'ner': u'NUMBER'}

Expected output : I would like the last 3 items to be labelled with SURFACE, ie the RegexNER result.

Let me know if more details are needed.

Emre Colak · Accepted Answer

Here's what the RegexNER documentation says about this:

RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.

Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

Lalor LOCATION PERSON

Labor ORGANIZATION

I'm not sure what your mapping file exactly looks like, but if it just maps entities to labels, then the original NER will label your entities as NUMBER, and RegexNER won't be able to overwrite them. If you explicitly declare that some NUMBER entities should be overwritten as SURFACE in your mapping file, then it should work.

stellasia · Answer

Ok, things seem to work as I want if I put the regexner first:

"annotators":"regexner,tokenize,ssplit,pos,ner",

seems there is an ordering problem at some stage of the process?

Having both NER and RegexNER tags in StanfordCoreNLPServer output?

Tags:

stanford-nlp

stanford-nlp-server

stellasia

2 Answers

Emre Colak

stellasia

Recent Activity

Donate For Us

Having both NER and RegexNER tags in StanfordCoreNLPServer output?

Tags:

stanford-nlp

stanford-nlp-server

stellasia

2 Answers

Emre Colak

stellasia

Related questions

Recent Activity

Donate For Us