Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train Spacy NER on Indian Names

I am trying to customize Spacy's NER to identify Indian names. Following this guide https://spacy.io/usage/training and this is the dataset I am using https://gist.githubusercontent.com/mbejda/9b93c7545c9dd93060bd/raw/b582593330765df3ccaae6f641f8cddc16f1e879/Indian-Female-Names.csv

As per the code , I am supposed to provide training data in following format:

TRAIN_DATA = [
    ('Shivani', {
        'entities': [(0, 6, 'PERSON')]
    }),
    ('Isha ', {
        'entities': [(0,3 , 'PERSON')]
    })
]

How do I provide training data to Spacy for ~12000 names as manually specifying each entity will be a chore? Is there any other tool available to tag all the names ?

like image 606
shri_wahal Avatar asked Mar 26 '18 04:03

shri_wahal


People also ask

How do you train an existing spaCy model?

The recommended way to train your spaCy pipelines is via the spacy train command on the command line. It only needs a single config. cfg configuration file that includes all settings and hyperparameters.

How can I improve my spaCy NER accuracy?

Re-train spacy NER with your custom examples: If you have, for instance, a few hundred examples with real addresses, you can manually TAG it and then re-train the spacy NER to overfit your particular address. You can train a new NER from scratch or fine-tune an existing one.

How are NER models trained?

In terms of NER, developers use a machine learning-based solution. During the first phase, the ML model is trained on the annotated documents. The amount of time it will take to train the model will depend on the complexity of the model. The next phase involves annotating raw documents using the trained model.


2 Answers

You are missing the point of training a NLP library for custom names. The training data has to be a list of training entries that each have a sentence text with the location of the name(s) identified. Please review the training data example again to see how you need to supply a full sentence and not just a name.

Spacy is not meant to be a gazette matching tool. You are likely better off generating 100 sentences that use some of these names and then training Spacy on those annotated sentences. You can add more full sentence examples as needed to increase accuracy. Spacy's native NER for names is robust and does not need 12000 examples.

@ak_35's answer below provides examples of how to provide training sentences with the location of names labeled.

like image 130
Adnan S Avatar answered Oct 03 '22 17:10

Adnan S


Your current format for providing TRAIN_DATA will not give you good results. Spacy needs data in the format as shown below

TRAIN_DATA = [
('Shivani lives in chennai', {
        'entities': [(0, 6, 'PERSON')]
    }),
 ('Did you talk to Shivani yesterday', {
        'entities': [(16, 22, 'PERSON')]
    }),

    ('Isha bought a new phone', {
        'entities': [(0,3 , 'PERSON')]
    })

]

See the documentation here. Coming to your question about automating the task of annotation 12000 entries, there are tools that can help you in quickly annotating your data. You can use prodigy (same developers as spacy) but it is a paid service. You can see it in action here. In case you give up on the NER, Pattern matching might also work well for you if you just need to find names in a document, it would be faster and more accurate too if done right.

like image 37
ak_35 Avatar answered Oct 03 '22 18:10

ak_35