Situation:
I have a CSV file A
with two columns Customer ID
and Entry date
.
A
contains about 1.500.000 observations.
I have another CSV file B
with a single column Customer ID
.
B
is a smaller subset of A
.
Goal:
Since the info about their entry date is missing in table B
, I would like to get that info from table A
and write it all into a new table C
.
Current Progress:
I've created 10 subsets S1,...,S10 from A
and from each subset the maximum customer ID. In a for
loop, I run through all entries of B
and check if B
lies within one of the subsets (via customer ID
and max c ID
of the subset). Once I've found a subset in which I am supposed to find the customer ID, I use the function which
to look for the element of B in A.
This is awfully slow.
Isn't there another quicker way?
And which would be the best objects in R to use the CSV file as, currently, A
is a data Frame, and B
is a large integer.
I would use data.table
. It is trivially easy to do this (see last command!), and very fast using what is known as a keyed join. Basically you look up entries from b
in a
using their common key (in your case "Customer ID"
). As an example:
require(data.table)
a <- data.table(id=1:10,date=as.Date(1:10))
setkey(a,id)
b <- data.table(id=4:6)
setkey(b,id)
a[b]
# id date
#1: 4 2016-02-01
#2: 5 2016-02-02
#3: 6 2016-02-03
With your given example you would type this, to read in your data and do a keyed join to get the entry date for each person in table b
:
a <- fread( "A.csv" )
setkey(a, "Customer ID")
b <- fread( "B.csv" )
setkey(a, "Customer ID")
c <- a[b]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With