I have the training data for a new NER type in the "Training an additional entity type" section of the spaCy documentation.
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("Do they bite?", {
'entities': []
}),
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
}),
("horses?", {
'entities': [(0, 6, 'ANIMAL')]
})
]
I want to train an NER model on this data using the spacy
command line application. This requires data in spaCy's JSON format. How do I write the above data (i.e. text with labeled character offset spans) in this JSON format?
After looking at the documentation for that format, it's not clear to me how to manually write data in this format. (For example, do I have partition everything into paragraphs?) There is also a convert command line utility that converts from non-spaCy data formats to spaCy's format, but that doesn't take a spaCy format like the one above as input.
I understand the examples of NER training code that uses the "Simple training style", but I'd like to be able to use the command line utility for training. (Though as is apparent from my previous spaCy question, I'm unclear when you're supposed to use that style and when you're supposed to use the command line.)
Can someone show me an example of the above data in "spaCy's JSON format", or point to documentation that explains how to make this transformation.
There's a built in function to spaCy
that will get you most of the way there:
from spacy.gold import biluo_tags_from_offsets
That takes in the "offset" type annotations you have there and converts them to the token-by-token BILOU format.
To put the NER annotations into the final training JSON format, you just need a bit more wrapping around them to fill out the other slots the data requires:
sentences = []
for t in TRAIN_DATA:
doc = nlp(t[0])
tags = biluo_tags_from_offsets(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].string,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
Make sure that you disable the non-NER pipelines before training with this data.
I've run into some issues using spacy train
on NER-only data. See #1907 and also check out this discussion on the Prodigy forum for some possible workarounds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With