Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does NER model to find person names inside a resume/CV?

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

The problem here is that document like those can have different structure, for example i can have:

CASE 1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

CASE 2

My name is John Travolta and I was born ...

Basically, i can have structured data (with different labels) or a context where i should find these entities.

What is the best approach for this kind of documents? Can a maxent model work in this case?


EDIT @vihari-piratla

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

Example:

Name: John
Birthdate: 2000-01-01

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

like image 564
Dail Avatar asked Dec 28 '15 23:12

Dail


People also ask

What is a ner model?

The NER model is one of a number of methods for determining the accuracy of live subtitles in television broadcasts and events that are produced using speech recognition. The three letters stand for number, edition error and recognition error.

What is NER used for?

Named entity recognition (NER) helps you easily identify the key elements in a text, like names of people, places, brands, monetary values, and more. Extracting the main entities in a text helps sort unstructured data and detect important information, which is crucial if you have to deal with large datasets.

Which model is best for named entity recognition task?

NER with Spacy Spacy is an open-source NLP library for advanced Natural Language Processing in Python and Cython. It's well maintained and has over 20K stars on Github. There are several pre-trained models in Spacy that you can use directly on your data for tasks like NER, Information Extraction etc.

Which NER tool is better for entity recognition in Corpus?

Sequor has a flexible feature template language and is meant mainly for NLP applications such as Named Entity recognition, Part of Speech tagging and syntactic chunking. It includes pre-trained models for German and English.


1 Answers

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

  1. Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.

  2. Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

  • SPIED is a software that uses the above-described technique and is available for download and use.
  • Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
  • For a light introduction on this topic, see these slides.

Thanks

like image 125
Vihari Piratla Avatar answered Sep 22 '22 14:09

Vihari Piratla