Case-sensitive entity recognition

Tags:

I have keywords that are all stored in lower case, e.g. "discount nike shoes", that I am trying to perform entity extraction on. The issue I've run into is that spaCy seems to be case sensitive when it comes to NER. Mind you , I don't think that this is spaCy specific.

When I run...

doc = nlp(u"i love nike shoes from the uk")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

... nothing is returned.

When I run...

doc = nlp(u"i love Nike shoes from the Uk")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

I get the following results...

Nike 7 11 ORG
Uk 25 27 GPE

Should I just title case everything? Is there another workaround that I could use?

802

asked May 30 '19 19:05

Emma Jean

1 Answers

spaCy's pre-trained statistical models were trained on a large corpus of general news and web text. This means that the entity recognizer has likely only seen very few all-lowercase examples, because that's much less common in those types of texts. In English, capitalisation is also a strong indicator for a named entitiy (unlike German, where all nouns are typically capitalised), so the model probably tends to pay more attention to that.

If you're working with text that doesn't have proper capitalisation, you probably want to fine-tune the model to be less sensitive here. See the docs on updating the named entity recognizer for more details and code examples.

Producing the training examples will hopefully not be very difficult, because you can use existing annotations and datasets, or create one using the pre-trained model, and then lowercase everything. For example, you could take text with proper capitalisation, run the model over it and extract all entitiy spans in the text. Next, you lowercase all the texts, and update the model with the new data. Make sure to also mix in text with proper capitalisation, because you don't want the model to learn something like "Everything is lowercase now! Capitalisation doesn't exist anymore!".

Btw, if you have entities that can be defined using a list or set of rules, you might also want to check out the EntityRuler component. It can be combined with the statistical entity recognizer and will let you pass in a dictionary of exact matches or abstract token patterns that can be case-insensitive. For instance, [{"lower": "nike"}] would match one token whose lowercase form is "nike" – so "NIKE", "Nike", "nike", "NiKe" etc.

answered Oct 11 '22 04:10

Ines Montani

Related questions
                            
                                Watch stdout and stderr of a subprocess simultaneously
                            
                                How to take a pathname string with wildcards and resolve the glob with pathlib?
                            
                                Python: When should we name the parameters we're passing?
                            
                                pycharm doesn't see python3.7 interpreter
                            
                                Sklearn fit vs predict, order of columns matters?
                            
                                What does p stand for in "fp" of with open(filename, "w") as fp:
                            
                                Installing Apache-Airflow in Conda Environment
                            
                                Loop break breaking tqdm
                            
                                Numpy "Where" function can not avoid evaluate Sqrt(negative)
                            
                                Keras breaks Anaconda Prompt
                            
                                How to do forward filling for each group in pandas
                            
                                Output multiple losses added by add_loss in Keras
                            
                                How to check and get Alexa slot value with Python ask sdk
                            
                                Open a Word Document Using Python [duplicate]
                            
                                Package missing in Alpine Linux even though it's listed on package repo website [closed]
                            
                                building wheel for dlib (setup.py) loop
                            
                                No module named PyQt5.sip
                            
                                how to save a pandas DataFrame to an excel file?
                            
                                How to get authenticated identity response from AWS Cognito using boto3
                            
                                "not all arguments converted during string formatting" when to_sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Case-sensitive entity recognition

Tags:

python

named-entity-recognition

spacy

Emma Jean

People also ask

1 Answers

Ines Montani

Recent Activity

Donate For Us