Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R RecordLinkage Identity

Tags:

r

linkage

record

I am working with RecordLinkage Library in R. I have a data frame with id, name, phone, mail

My code looks like this:

ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))

The problem is that my ids are not the same in my result output so if I had this data:

id   Name     Phone    Mail
233  Nathali  2222     [email protected]
435  Nathali  2222 
553  Jean     3444     [email protected]

In my result output I will have something like

id1 id2
1   2

Instead of

id1 id2
233 435 

I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.

Thanks

like image 685
Náthali Avatar asked Mar 16 '16 17:03

Náthali


1 Answers

The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.

In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:

c(1,1,2)

But it could also be:

c(42,42,128)

Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).

About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):

getPairs(pairs)

There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.

p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.

like image 189
Andreas Borg Avatar answered Oct 13 '22 21:10

Andreas Borg