I am trying to customize Spacy's NER to identify Indian names. Following this guide https://spacy.io/usage/training and this is the dataset I am using https://gist.githubusercontent.com/mbejda/9b93c7545c9dd93060bd/raw/b582593330765df3ccaae6f641f8cddc16f1e879/Indian-Female-Names.csv
As per the code , I am supposed to provide training data in following format:
TRAIN_DATA = [
('Shivani', {
'entities': [(0, 6, 'PERSON')]
}),
('Isha ', {
'entities': [(0,3 , 'PERSON')]
})
]
How do I provide training data to Spacy for ~12000 names as manually specifying each entity will be a chore? Is there any other tool available to tag all the names ?
The recommended way to train your spaCy pipelines is via the spacy train command on the command line. It only needs a single config. cfg configuration file that includes all settings and hyperparameters.
Re-train spacy NER with your custom examples: If you have, for instance, a few hundred examples with real addresses, you can manually TAG it and then re-train the spacy NER to overfit your particular address. You can train a new NER from scratch or fine-tune an existing one.
In terms of NER, developers use a machine learning-based solution. During the first phase, the ML model is trained on the annotated documents. The amount of time it will take to train the model will depend on the complexity of the model. The next phase involves annotating raw documents using the trained model.
You are missing the point of training a NLP library for custom names. The training data has to be a list of training entries that each have a sentence text with the location of the name(s) identified. Please review the training data example again to see how you need to supply a full sentence and not just a name.
Spacy is not meant to be a gazette matching tool. You are likely better off generating 100 sentences that use some of these names and then training Spacy on those annotated sentences. You can add more full sentence examples as needed to increase accuracy. Spacy's native NER for names is robust and does not need 12000 examples.
@ak_35's answer below provides examples of how to provide training sentences with the location of names labeled.
Your current format for providing TRAIN_DATA will not give you good results. Spacy needs data in the format as shown below
TRAIN_DATA = [
('Shivani lives in chennai', {
'entities': [(0, 6, 'PERSON')]
}),
('Did you talk to Shivani yesterday', {
'entities': [(16, 22, 'PERSON')]
}),
('Isha bought a new phone', {
'entities': [(0,3 , 'PERSON')]
})
]
See the documentation here. Coming to your question about automating the task of annotation 12000 entries, there are tools that can help you in quickly annotating your data. You can use prodigy (same developers as spacy) but it is a paid service. You can see it in action here. In case you give up on the NER, Pattern matching might also work well for you if you just need to find names in a document, it would be faster and more accurate too if done right.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With