What does NER model to find person names inside a resume/CV?

Tags:

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

The problem here is that document like those can have different structure, for example i can have:

CASE 1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

CASE 2

My name is John Travolta and I was born ...

Basically, i can have structured data (with different labels) or a context where i should find these entities.

What is the best approach for this kind of documents? Can a maxent model work in this case?

EDIT @vihari-piratla

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

Example:

Name: John
Birthdate: 2000-01-01

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

564

asked Dec 28 '15 23:12

Dail

1 Answers

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.
Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

SPIED is a software that uses the above-described technique and is available for download and use.
Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
For a light introduction on this topic, see these slides.

Thanks

125

answered Sep 22 '22 14:09

Vihari Piratla

Related questions
                            
                                Natural Language Processing in PHP
                            
                                what is the MeCab output and the tagset?
                            
                                NLTK Context Free Grammar Genaration
                            
                                Methods for extracting locations from text?
                            
                                How to generate a list of antonyms for adjectives in WordNet using Python
                            
                                spaCy 2.0: Save and Load a Custom NER model
                            
                                Ways to store and access large (~10 GB) lists in Python?
                            
                                Extracting a part of a Spacy document as a new document
                            
                                Disabling part of the nlp pipeline
                            
                                Is there a search engine that will give a direct answer? [closed]
                            
                                Render linguistic syntax tree in browser
                            
                                Regex add character to matched string
                            
                                NLTK. Find if a sentence is in a questioning form
                            
                                How to get domain of words using WordNet in Python?
                            
                                Optimizer and scheduler for BERT fine-tuning
                            
                                Given a document, select a relevant snippet
                            
                                How to find out wether a word exists in english using nltk
                            
                                How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?
                            
                                How to find out the entropy of the English language
                            
                                Generate all word forms using Lucene & Hunspell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does NER model to find person names inside a resume/CV?

Tags:

nlp

stanford-nlp

named-entity-recognition

EDIT @vihari-piratla

Dail

People also ask

1 Answers

Vihari Piratla

Recent Activity

Donate For Us