Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert NER SpaCy format to IOB format

I have data which is already labelled in SpaCy format. For example:

("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})

But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?

Thanks!

like image 309
eng2019 Avatar asked Jan 25 '23 10:01

eng2019


1 Answers

This is closely related to and mostly copied from https://stackoverflow.com/a/59209377/461847, see the notes in the comments there, too:

import spacy
from spacy.gold import biluo_tags_from_offsets

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    # then convert L->I and U->B to have IOB tags for the tokens in the doc
like image 172
aab Avatar answered Jan 30 '23 21:01

aab