Situation:
I have a CSV file A with two columns Customer ID and Entry date.
A contains about 1.500.000 observations.
I have another CSV file B with a single column Customer ID.
B is a smaller subset of A.
Goal:
Since the info about their entry date is missing in table B, I would like to get that info from table A and write it all into a new table C.
Current Progress:
I've created 10 subsets S1,...,S10 from A and from each subset the maximum customer ID. In a for loop, I run through all entries of B and check if B lies within one of the subsets (via customer ID and max c ID of the subset). Once I've found a subset in which I am supposed to find the customer ID, I use the function which to look for the element of B in A.
This is awfully slow.
Isn't there another quicker way?
And which would be the best objects in R to use the CSV file as, currently, A is a data Frame, and B is a large integer.
I would use data.table. It is trivially easy to do this (see last command!), and very fast using what is known as a keyed join. Basically you look up entries from b in a using their common key (in your case "Customer ID"). As an example:
require(data.table)
a <- data.table(id=1:10,date=as.Date(1:10))
setkey(a,id)
b <- data.table(id=4:6)
setkey(b,id)
a[b]
# id date
#1: 4 2016-02-01
#2: 5 2016-02-02
#3: 6 2016-02-03
With your given example you would type this, to read in your data and do a keyed join to get the entry date for each person in table b:
a <- fread( "A.csv" )
setkey(a, "Customer ID")
b <- fread( "B.csv" )
setkey(a, "Customer ID")
c <- a[b]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With