Train Spacy NER on Indian Names

Tags:

I am trying to customize Spacy's NER to identify Indian names. Following this guide https://spacy.io/usage/training and this is the dataset I am using https://gist.githubusercontent.com/mbejda/9b93c7545c9dd93060bd/raw/b582593330765df3ccaae6f641f8cddc16f1e879/Indian-Female-Names.csv

As per the code , I am supposed to provide training data in following format:

Click to copy

TRAIN_DATA = [
    ('Shivani', {
        'entities': [(0, 6, 'PERSON')]
    }),
    ('Isha ', {
        'entities': [(0,3 , 'PERSON')]
    })
]

How do I provide training data to Spacy for ~12000 names as manually specifying each entity will be a chore? Is there any other tool available to tag all the names ?

606

asked Mar 26 '18 04:03

shri_wahal

2 Answers

You are missing the point of training a NLP library for custom names. The training data has to be a list of training entries that each have a sentence text with the location of the name(s) identified. Please review the training data example again to see how you need to supply a full sentence and not just a name.

Spacy is not meant to be a gazette matching tool. You are likely better off generating 100 sentences that use some of these names and then training Spacy on those annotated sentences. You can add more full sentence examples as needed to increase accuracy. Spacy's native NER for names is robust and does not need 12000 examples.

@ak_35's answer below provides examples of how to provide training sentences with the location of names labeled.

130

answered Oct 03 '22 17:10

Adnan S

Your current format for providing TRAIN_DATA will not give you good results. Spacy needs data in the format as shown below

Click to copy

TRAIN_DATA = [
('Shivani lives in chennai', {
        'entities': [(0, 6, 'PERSON')]
    }),
 ('Did you talk to Shivani yesterday', {
        'entities': [(16, 22, 'PERSON')]
    }),

    ('Isha bought a new phone', {
        'entities': [(0,3 , 'PERSON')]
    })

]

See the documentation here. Coming to your question about automating the task of annotation 12000 entries, there are tools that can help you in quickly annotating your data. You can use prodigy (same developers as spacy) but it is a paid service. You can see it in action here. In case you give up on the NER, Pattern matching might also work well for you if you just need to find names in a document, it would be faster and more accurate too if done right.

answered Oct 03 '22 18:10

ak_35

Related questions
                            
                                Numpy ndarray shape with 3 parameters
                            
                                ThreadPoolExecutor with context manager
                            
                                How to preserve the datatype while iterating dataframe in pandas?
                            
                                Dask dataframes: reading multiple files & storing filename in column
                            
                                Collapse Dataframe Pivot to Single Row
                            
                                Python conditional joining of *consecutive* strings that don't end in punctuation with those that do
                            
                                Find maximum value of time in list containing tuples of time in format ('hour', 'min', 'AM/PM')
                            
                                How to add a table in django app models from PostgreSQL?
                            
                                Passing argument in groupby.agg with multiple functions
                            
                                Pandas groupby and sum total of group
                            
                                Pandas groupby conditional subtraction
                            
                                Pandas dataframe to excel gives "file is not UTF-8 encoded"
                            
                                Can the sigmoid activation function be used to solve regression problems in Keras?
                            
                                Understanding Partial Dependence for Gradient Boosted Regression trees
                            
                                How to get value of a column based on the maximum of another column in case of DataFrame.groupby
                            
                                "detail": "Method \"GET\" not allowed. on calling endpoint in django
                            
                                Count zero rows in 2D numpy array
                            
                                Merge items on dataframes with duplicate values
                            
                                Extracting the person names in the named entity recognition in NLP using Python
                            
                                Django Model's DateTimeField is taking UTC even when timezone is Asia/Calcutta everywhere

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Train Spacy NER on Indian Names

Tags:

python

python-3.x

nlp

named-entity-recognition

spacy

shri_wahal

People also ask

2 Answers

Adnan S

ak_35

Recent Activity

Donate For Us