Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Person Name from unstructure text

I have a collection of bills and Invoices, so there is no context in the text (i mean they don't tell a story). I want to extract people names from those bills. I tried OpenNLP but the quality of trained model is not good because i don't have context. so the first question is: can I train model contains only people names without context? and if that possible can you give me good article for how i build that new model (most of the article that i read didn't explain the steps that i should made to build new model).

I have database name with more than 100,000 person name (first name, last name), so if the NER systems don't work in my case (because there is no context), what is the best way to search for those candidates (I mean searching for every first name with all other last names?)

thanks.

like image 765
anas.khayata Avatar asked Jan 07 '15 08:01

anas.khayata


2 Answers

Regarding "context", I guess you mean that you don't have entire sentences, i.e. no previous / next tokens, and in this case you face quite a non-standard NER. I am not aware of available software or training data for this particular problem, if you found none you'll have to build your own corpus for training and/or evaluation purposes.

Your database of names will probably greatly help, depending indeed on what proportion of bill names are actually present in the database. You'll also probably have to rely on character-level morphology of names, as patterns (see for instance patterns in [1]). Once you have a training set with features (presence in database, morphology, other information of bill) and solutions (actual names of annotated bills), using standard machine-learning as SVM will be quite straightforward (if you are not familiar with this, just ask).

Some other suggestions:

  • You may most probably also use other bill's information: company name, positions, tax mentions, etc.
  • You may also proceed in a a selective manner - if all bills should mention (exactly?) one person name, you may exclude all other texts (e.g. amounts, tax names, positions etc.) or assume in a dedicated model that among all text in a bill, only one should be guessed as a name.

[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)

like image 187
eldams Avatar answered Sep 28 '22 00:09

eldams


I'd start with some regular expressions, then possibly augment that with a dictionary-based approach (i.e., big list of names).

No matter what you do, it won't be perfect, so be sure to keep that in mind.

like image 28
Charlie Greenbacker Avatar answered Sep 28 '22 00:09

Charlie Greenbacker