I am building a Named Entity Recognition model for biomedical text (cancer papers from Pubmed). I trained a custom NER model using spacy for 3 entities (DISEASE, GENE, and DRUG) types. Further, I combined the model with rule based components to improve the accuracy of my model.
Here is my current code -
# Loaded the trained NER Model
nlp = spacy.load("my_spacy_model")
# Define entity patterns for EntityRuler (just showing 2 relevant patterns here, it contains more patterns)
patterns = [{"label": "GENE", "pattern": "BRCA1"},
{"label": "GENE", "pattern": "BRCA2"}]
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
When I test the above code on the following piece of text -
text = "Exceptional response to olaparib in BRCA2-altered breast cancer after PD-L1 inhibitor and chemotherapy failure"
I get the following result -
DISEASE BRCA2-altered breast cancer
DRUG olaparib
GENE PD-L1
However, the correct answer is -
GENE BRCA2
^^^^^^^^^^^
DISEASE breast cancer
^^^^^^^^^^^^^^^^^^^^^
DRUG olaparib
GENE PD-L1
The model is not recognizing BRCA2 as a gene, which I have added in the patterns for EntitytRuler.
Is there a way to prioritize predictions from rule-based matching over the trained model? Alternatively, is there something else I can do to get the correct results by combining rule-based matching?
You can either add the EntityRuler before the NER component in the pipeline:
nlp.add_pipe(ruler, before="ner")
Or tell the EntityRuler to overwrite existing entities:
ruler = EntityRuler(nlp, overwrite_ents=True)
The NER predictions might be slightly different in each case, because in the first option, the model's predictions might change given the presence of existing entity spans.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With