Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Formatting training dataset for SpaCy NER

I want to train a blank model for NER with my own entities. To do this, I need to use a dataset, which is currently in .csv form and features entity tags in the following format (I'll provide one example row for each relevant column):


Column: sentence

Value: I want apples


Column: data

Value: ['want;@command;2;6','apples';@fruit;7;13']


Column: entity

Value: I @command @fruit


Column: entity_types

Value: @bot/@command;@bot/@food/@fruit


In order to train SpaCy's NER, I need the training data as json in the following form:

    TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

Link to the relevant part in the SpaCy Docs

I've tried to find a solution for how I could re-format the data from the csv to the format required by SpaCy, but I was unsuccessful as of yet. The dataset does contain all the necessary information - text string, entity names, entity types, entity offsets - but I simply don't know how to get them in the correct form.

I would appreciate any and all help concerning how I would accomplish this!

like image 951
Dionysos Avatar asked Nov 22 '17 21:11

Dionysos


1 Answers

It wasn't 100% clear from your question whether you're also asking about the CSV extraction – so I'll just assume this is not the problem. (If it is, this should be pretty easy to achieve using the csv module. If the CSV data is messy and contains a bunch of stuff combined in one string, you might have to call split on it and do it the hacky way.)

If you're able to extract the "sentence" and "data" column in a format like this, you're actually very close to spaCy's training format already:

[{ 
    'sentence': 'I want apples'
    'data': [('want', '@command', 2, 6) ('apples', '@fruit', 7, 13)]
}]

It seems like your data counts the end character differently and with an offset of +1 compared to spaCy. So you'll have to adjust this by subtracting 1. I'm probably making this a lot more verbose than it should be, but I hope this makes it easier to follow:

TRAIN_DATA = []

for example in your_extracted_data:  # see example above
    entities = []
    for entity in example['data']:  # iterate over the entities
        text, label, start, end = entity  # ('want', '@command', 2, 6)
        label = label.split('@')[1].upper()  # not necessary, but nicer
        end = end - 1  # correct the end character index
        entities.append((start, end, label))
    # add training example of (text, annotations) tuple
    TRAIN_DATA.append((example['sentence'], {'entities': entities}))

This should give you training data that looks like this:

[
    ('I want apples', {'entities': [(2, 5, 'COMMAND'), (7, 12, 'FRUIT')]})
]
like image 169
Ines Montani Avatar answered Sep 22 '22 12:09

Ines Montani