Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert simple training style data to spaCy's command line JSON format?

Tags:

spacy

I have the training data for a new NER type in the "Training an additional entity type" section of the spaCy documentation.

TRAIN_DATA = [
    ("Horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("Do they bite?", {
        'entities': []
    }),

    ("horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("horses pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("they pretend to care about your feelings, those horses", {
        'entities': [(48, 54, 'ANIMAL')]
    }),

    ("horses?", {
        'entities': [(0, 6, 'ANIMAL')]
    })
]

I want to train an NER model on this data using the spacy command line application. This requires data in spaCy's JSON format. How do I write the above data (i.e. text with labeled character offset spans) in this JSON format?

After looking at the documentation for that format, it's not clear to me how to manually write data in this format. (For example, do I have partition everything into paragraphs?) There is also a convert command line utility that converts from non-spaCy data formats to spaCy's format, but that doesn't take a spaCy format like the one above as input.

I understand the examples of NER training code that uses the "Simple training style", but I'd like to be able to use the command line utility for training. (Though as is apparent from my previous spaCy question, I'm unclear when you're supposed to use that style and when you're supposed to use the command line.)

Can someone show me an example of the above data in "spaCy's JSON format", or point to documentation that explains how to make this transformation.

like image 732
W.P. McNeill Avatar asked Feb 21 '18 22:02

W.P. McNeill


1 Answers

There's a built in function to spaCy that will get you most of the way there:

from spacy.gold import biluo_tags_from_offsets

That takes in the "offset" type annotations you have there and converts them to the token-by-token BILOU format.

To put the NER annotations into the final training JSON format, you just need a bit more wrapping around them to fill out the other slots the data requires:

sentences = []
for t in TRAIN_DATA:
    doc = nlp(t[0])
    tags = biluo_tags_from_offsets(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].string,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)

Make sure that you disable the non-NER pipelines before training with this data. I've run into some issues using spacy train on NER-only data. See #1907 and also check out this discussion on the Prodigy forum for some possible workarounds.

like image 173
ahalt Avatar answered Nov 11 '22 08:11

ahalt