Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to efficiently match two data tables in R

Tags:

r

match

Situation: I have a CSV file A with two columns Customer ID and Entry date. A contains about 1.500.000 observations.

I have another CSV file B with a single column Customer ID. B is a smaller subset of A.

Goal: Since the info about their entry date is missing in table B, I would like to get that info from table A and write it all into a new table C.

Current Progress: I've created 10 subsets S1,...,S10 from A and from each subset the maximum customer ID. In a for loop, I run through all entries of B and check if B lies within one of the subsets (via customer ID and max c ID of the subset). Once I've found a subset in which I am supposed to find the customer ID, I use the function which to look for the element of B in A.

This is awfully slow.

Isn't there another quicker way?

And which would be the best objects in R to use the CSV file as, currently, A is a data Frame, and B is a large integer.

like image 832
blueeyes0710 Avatar asked Jan 28 '16 15:01

blueeyes0710


1 Answers

I would use data.table. It is trivially easy to do this (see last command!), and very fast using what is known as a keyed join. Basically you look up entries from b in a using their common key (in your case "Customer ID"). As an example:

require(data.table)
a <- data.table(id=1:10,date=as.Date(1:10))
setkey(a,id)
b <- data.table(id=4:6)
setkey(b,id)

a[b]
#   id       date
#1:  4 2016-02-01
#2:  5 2016-02-02
#3:  6 2016-02-03

With your given example you would type this, to read in your data and do a keyed join to get the entry date for each person in table b:

a <- fread( "A.csv" )
setkey(a, "Customer ID")
b <- fread( "B.csv" )
setkey(a, "Customer ID")

c <- a[b]
like image 107
Simon O'Hanlon Avatar answered Sep 27 '22 18:09

Simon O'Hanlon