I am using the StanfordCoreNLPServer to extract some informations from text (such as surfaces, street names)
The street is given by a specifically trained NER model, and the surface by a simple regex via the RegexNER.
Each of them work fine separately but when used together, only the NER result is present in the output, under the ner
tag. Why isn't there a regexner
tag? Is there a way to also have the RegexNER result?
For information:
StanfordCoreNLP v3.6.0
the URL used:
'http://127.0.0.1:9000/'
'?properties={"annotators":"tokenize,ssplit,pos,ner,regexner", '
'"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",'
'"tokenize.language":"fr",'
'"ner.model":"ner-model.ser.gz", ' # custom NER model with STREET labels
'"regexner.mapping":"rules.tsv", ' # SURFACE label
'"outputFormat": "json"}'
as suggested here, the regexner
annotator is after the ner
, but still...
The current output (extract):
{u'index': 4, u'word': u'dans', u'lemma': u'dans', u'pos': u'P', u'characterOffsetEnd': 12, u'characterOffsetBegin': 8, u'originalText': u'dans', u'ner': u'O'}
{u'index': 5, u'word': u'la', u'lemma': u'la', u'pos': u'DET', u'characterOffsetEnd': 15, u'characterOffsetBegin': 13, u'originalText': u'la', u'ner': u'O'}
{u'index': 6, u'word': u'rue', u'lemma': u'rue', u'pos': u'NC', u'characterOffsetEnd': 19, u'characterOffsetBegin': 16, u'originalText': u'rue', u'ner': u'STREET'}
{u'index': 7, u'word': u'du', u'lemma': u'du', u'pos': u'P', u'characterOffsetEnd': 22, u'characterOffsetBegin': 20, u'originalText': u'du', u'ner': u'STREET'}
[...]
{u'index': 43, u'word': u'165', u'lemma': u'165', u'normalizedNER': u'165.0', u'pos': u'DET', u'characterOffsetEnd': 196, u'characterOffsetBegin': 193, u'originalText': u'165', u'ner': u'NUMBER'}
{u'index': 44, u'word': u'm', u'lemma': u'm', u'pos': u'NC', u'characterOffsetEnd': 198, u'characterOffsetBegin': 197, u'originalText': u'm', u'ner': u'O'}
{u'index': 45, u'word': u'2', u'lemma': u'2', u'normalizedNER': u'2.0', u'pos': u'ADJ', u'characterOffsetEnd': 199, u'characterOffsetBegin': 198, u'originalText': u'2', u'ner': u'NUMBER'}
Expected output : I would like the last 3 items to be labelled with SURFACE
, ie the RegexNER
result.
Let me know if more details are needed.
Here's what the RegexNER documentation says about this:
RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE
Lalor LOCATION PERSON
Labor ORGANIZATION
I'm not sure what your mapping file exactly looks like, but if it just maps entities to labels, then the original NER will label your entities as NUMBER, and RegexNER won't be able to overwrite them. If you explicitly declare that some NUMBER entities should be overwritten as SURFACE in your mapping file, then it should work.
Ok, things seem to work as I want if I put the regexner
first:
"annotators":"regexner,tokenize,ssplit,pos,ner",
seems there is an ordering problem at some stage of the process?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With