Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are,

(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc. 
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition. 

From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.

My task is to construct one resolution algorithm, where I would extract and resolve the entities.

So, I am working out an entity extractor in the first place. In the second place, if I try to relate the coreferences as I found from various literatures like this seminal work, they are trying to work out a decision tree based algorithm, with some features like, distance, i-pronoun, j-pronoun, string match, definite noun phrase, demonstrative noun phrase, number agreement feature, semantic class agreement, gender agreement, both proper names, alias, apposition etc.

The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).

I could work out one entity recognition system with HMM. Now I am trying to work out a coreference as well as an entity resolution system. I was trying to feel instead of using so many features if I use an annotated corpus and train it directly with HMM based tagger, with a view to solve a relationship extraction like,

*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"

where, PERS-> PERSON
       PPERS->PERSONAL PRONOUN TO PERSON
       PoPERS-> POSSESSIVE PRONOUN TO PERSON
       APPERS-> APPOSITIVE TO PERSON
       LOC-> LOCATION
       NA-> NOT AVAILABLE*

would I be wrong? I made an experiment with around 10,000 words. Early results seem encouraging. With a support from one of my colleague I am trying to insert some semantic information like, PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now. Please see this new thought too. I got some good results with Naive Bayes classifier also where sentences having predominately one set of keywords are marked as one class.

If any one may suggest any different approach, please feel free to suggest so.

I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim, pandas, Numpy, Scipy etc.

Thanks in Advance.

like image 269
Coeus2016 Avatar asked Apr 10 '16 20:04

Coeus2016


People also ask

How do you do entity resolution?

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization: Deduplication: eliminating duplicate (exact) copies of repeated data. Record linkage: identifying records that reference the same entity across different sources.

What is entity resolution used for?

Entity resolution is the process of working out whether multiple records are referencing the same real-world thing, such as a person, organization, address, phone number, bank account or device.

What is entity resolution in data mining?

The base technology is entity resolution (ER), which is sometimes called record linking, data matching, or de-duplication. ER is the process of determining when two information system references to a real-world entity are referring to the same, or to different, entities (Talburt, 2011).

What does dynamic entity resolution allow?

Dynamic entity resolution connects billions of data points across internal and external data sources in real-time or batch, to create a single, enterprise-wide view of people, organizations, places and more.


1 Answers

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.

Named Entity Recognition

Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.

Other solution may exist in openNLP for python .

Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.

Edit: Stanford NER exists in NLTK python

Named Entity Resolution/Linking/Disambiguation

This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.

AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.

Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.

others like tagme and wikifi ...etc

Conference Resolution

Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

like image 182
Mohamed Gad-Elrab Avatar answered Oct 12 '22 11:10

Mohamed Gad-Elrab