Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best method to confirm an entity

I would like to understand the best approach to the following problem.

I have documents really similar to resume/cv and I have to extract entities (Name, Surname, Birthday, Cities, zipcode etc).

To extract those entities I am combining different finders (Regex, Dictionary etc)

There are no problems with those finders, but, I am looking for a method / algorithm or something like that to confirm the entities.

With "confirm" I mean that I have to find specific term (or entities) in proximities (closer to the entities I have found).

Example:

My name is <name>
Name: <name>
Name and Surname: <name>

I can confirm the entity <name> because it is closer to specific term that let me understand the "context". If i have "name" or "surname" words near the entity so i can say that i have found the <name> with a good probability.

So the goal is write those kind of rules to confirm entities. Another example should be:

My address is ......, 00143 Rome

Italian zipcodes are 5 digits long (numeric only), it is easy to find a 5 digits number inside my document (i use regex as i wrote above), and i also check it by querying a database to understand if the number exists. The problem here is that i need one more check to confirm (definitely) it.

I must see if that number is near the entity <city>, if yes, ok... I have good probabilities.

I also tried to train a model but i do not really have a "context" (sentences). Training the model with:

My name is: <name>John</name>
Name: <name>John</name>
Name/Surname: <name>John</name>
<name>John</name> is my name

does not sound good to me because:

  1. I have read we need many sentences to train a good model
  2. Those are not "sentences" i do not have a "context" (remember where I said the document is similar to resume/cv)
  3. Maybe those phrases are too short

I do not know how many different ways i could find to say the exact thing, but surely I can not find 15000 ways :)

What method should I use to try to confirm my entities?

Thank you so much!

like image 990
Dail Avatar asked Sep 04 '15 14:09

Dail


1 Answers

Problem statement

First of all, I don't think that your decomposition of the task into 2 steps (extract and confirm) is the best, if only I don't miss some specifics in the problem. If I understand correctly, you goal is to extract structured info like Name/City/etc from the set of docs with maximum precision and recall; either metric can be more important, but usually they are considered with equal weights - e.g. by using F1-measure.

Evaluate first

'You can't control what you can't measure' Tom DeMarco

I'd propose to firstly prepare evaluation system and marked up dataset: for each document find correct Name/City/etc - it can be done fully manually (which is more 'true', but more hard way) or semi-automatically, e.g. by applying some method, including that under development, and correcting its errors if any. Evaluation system should be able to compute Precision and Recall (see Confusion matrix in order to easily implement them by yourself).

As for its size, I wouldn't be so afraid by necessity of preparing too big dataset: sure, more is better, but it is crucial for the case with complex (significantly non-linear) tasks and a lot of features. I believe 100-200 docs to be enough for start in your case - and it would take several hours to prepare.

Then you can evaluate your simple extractors based on RegExps and Dictionaries - best if different aspects (Name or City) would have separate metrics. Depending on results, your actions may differ.

Low precision - add more specific features

If the method shows too low precision, i.e. extract too many wrong items, you should add specificity, or specific features; I'd search for them in scientific papers devoted to Information extraction concerning to those aiming at the specific information type, be it Name/Surname, or Address, or something more vague like skills if you're interested in such info. For instance, many papers (like [2] and [3]) devoted to Resume parsing note that Name/Surname are usually placed at the very beginning of the text; or that Cities are usually preceded by 'at'. I don't know specifics of you documents, but I doubt they violate such patterns.

Also it may be useful and easy to treat output of Named Entity Recognizer, e.g. Standord NLP, as a feature (see also relevant question)

Again, harder but better is to analyze approaches used by NERC and to adapt them to specifics of your task and docs.

These features can be aggregated by any Supervised machine learning (start with Logistic Regression and Random Forest if you have not much experience): you know positive and negative (all but not positive) answers from your evaluation dataset, just transform them into feature space and feed to some ML lib like Weka.

Low recall - extract more candidates

If the method shows too low recall, i.e. misses a lot of items, then you should extend set of candidates - for example, develop less restrictive patterns, add fuzzy matching (look at Jaro-Winkler or Soundex string metric) to Dictionary lookup.

Another option is to apply Part-of-Speech tagging and take each noun as a candidate - maybe each Proper noun for some info items, or take noun bigrams, or add other weak restrictions. In this case, most probably, your precision will degrade, so the paragraph above would have to be considered.

NB: If your data comes from Web (e.g. profiles from LinkedIn), try to search by keywords 'Web data extraction' or take a look at import.io

Literature

just a few random, try to search at Google scholar, preferably start from surveys

  1. Renuka S. Anami, Gauri R. Rao. Automated Profile Extraction and Classification with Stanford Algorithm. International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-4 Issue-7, December 2014 (link)

  2. Swapnil Sonar. Resume Parsing with Named Entity Clustering Algorithm. 2015 (link)

like image 197
Nikita Astrakhantsev Avatar answered Nov 04 '22 11:11

Nikita Astrakhantsev