I want to train a blank model for NER with my own entities. To do this, I need to use a dataset, which is currently in .csv form and features entity tags in the following format (I'll provide one example row for each relevant column):
Column: sentence
Value: I want apples
Column: data
Value: ['want;@command;2;6','apples';@fruit;7;13']
Column: entity
Value: I @command @fruit
Column: entity_types
Value: @bot/@command;@bot/@food/@fruit
In order to train SpaCy's NER, I need the training data as json in the following form:
TRAIN_DATA = [
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]
Link to the relevant part in the SpaCy Docs
I've tried to find a solution for how I could re-format the data from the csv to the format required by SpaCy, but I was unsuccessful as of yet. The dataset does contain all the necessary information - text string, entity names, entity types, entity offsets - but I simply don't know how to get them in the correct form.
I would appreciate any and all help concerning how I would accomplish this!
It wasn't 100% clear from your question whether you're also asking about the CSV extraction – so I'll just assume this is not the problem. (If it is, this should be pretty easy to achieve using the csv
module. If the CSV data is messy and contains a bunch of stuff combined in one string, you might have to call split
on it and do it the hacky way.)
If you're able to extract the "sentence" and "data" column in a format like this, you're actually very close to spaCy's training format already:
[{
'sentence': 'I want apples'
'data': [('want', '@command', 2, 6) ('apples', '@fruit', 7, 13)]
}]
It seems like your data counts the end character differently and with an offset of +1
compared to spaCy. So you'll have to adjust this by subtracting 1
. I'm probably making this a lot more verbose than it should be, but I hope this makes it easier to follow:
TRAIN_DATA = []
for example in your_extracted_data: # see example above
entities = []
for entity in example['data']: # iterate over the entities
text, label, start, end = entity # ('want', '@command', 2, 6)
label = label.split('@')[1].upper() # not necessary, but nicer
end = end - 1 # correct the end character index
entities.append((start, end, label))
# add training example of (text, annotations) tuple
TRAIN_DATA.append((example['sentence'], {'entities': entities}))
This should give you training data that looks like this:
[
('I want apples', {'entities': [(2, 5, 'COMMAND'), (7, 12, 'FRUIT')]})
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With