I have two dataframes; one is 48 rows long and looks like this:
name = Z31
Est.Date Site Cultivar Planting
1 24/07/2011 Birchip Axe 1
2 08/08/2011 Birchip Bolac 1
3 24/07/2011 Birchip Derrimut 1
4 12/08/2011 Birchip Eaglehawk 1
5 29/07/2011 Birchip Gregory 1
6 29/07/2011 Birchip Lincoln 1
7 23/07/2011 Birchip Mace 1
8 29/07/2011 Birchip Scout 1
9 17/09/2011 Birchip Axe 2
10 19/09/2011 Birchip Bolac 2
The other is > 23000 rows and contains output from a simulator. It looks like this:
name = pred
Date maxt mint Cultivar Site Planting tt cum_tt
1 5/05/2011 18 6.5 Axe Birchip 1 12.25 0
2 6/05/2011 17.5 2.5 Axe Birchip 1 10 0
3 7/05/2011 18 2.5 Axe Birchip 1 10.25 0
4 8/05/2011 19.5 2 Axe Birchip 1 10.75 0
5 9/05/2011 17 4.5 Axe Birchip 1 10.75 0
6 10/05/2011 15.5 -0.5 Axe Birchip 1 7.5 0
7 11/05/2011 14 5.5 Axe Birchip 1 9.75 0
8 12/05/2011 19 8 Axe Birchip 1 13.5 0
9 13/05/2011 18.5 7.5 Axe Birchip 1 13 0
10 14/05/2011 16 3.5 Axe Birchip 1 9.75 0
What I want to do is have the cum_tt column start to add the tt column of the current row to the cum_tt of the previous row (cumulative addition) ONLY if the date in the pred DF is equal to or greater than the Z31 est.Date. I have written the following for loop:
for (i in 1:nrow(Z31)){
for (j in 1:nrow(pred)){
if (Z31[i,]$Site == pred[j,]$Site & Z31[i,]$Cultivar == pred[j,]$Cultivar & Z31[i,]$Planting == pred[j,]$Planting &
pred[j,]$Date >= Z31[i,]$Est.Date)
{
pred[j,]$cum_tt <- pred[j,]$tt + pred[j-1,]$cum_tt
}
}
}
This works but it is so slow it will take about an hour to run the entire set. I know loops are not R's strong point so can anyone assist me with vectorising this operation?
Thanks in advance.
UPDATE
Here's the output from dput(Z31):
structure(list(Est.Date = structure(c(15179, 15194, 15179, 15198,
15184, 15184, 15178, 15184, 15234, 15236, 15230, 15238, 15229,
15236, 15229, 15231, 15155, 15170, 15160, 15168, 15165, 15159,
15170, 15170, 15191, 15205, 15198, 15203, 15202, 15195, 15203,
15206, 15193, 15193, 15195, 15200, 15193, 15205, 15200, 15205,
15226, 15245, 15231, 15259, 15241, 15241, 15241, 15241), class = "Date"),
Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Birchip", "Gatton",
"Tarlee"), class = "factor"), Cultivar = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L), .Label = c("Axe", "Bolac", "Derrimut", "Eaglehawk",
"Gregory", "Lincoln", "Mace", "Scout"), class = "factor"),
Planting = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L)), .Names = c("Est.Date", "Site",
"Cultivar", "Planting"), row.names = c(NA, -48L), class = "data.frame")
Here's pred. Note there are some extra colums here. I've just included the relevant ones above for ease of reading.
structure(list(Date = structure(c(15099, 15100, 15101, 15102,
15103, 15104, 15105, 15106, 15107, 15108, 15109, 15110, 15111,
15112, 15113, 15114, 15115, 15116, 15117, 15118), class = "Date"),
flowering_das = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Zadok = c(9, 9,
9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 11, 11.032, 11.085,
11.157), stagename = structure(c(8L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 9L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 3L, 3L, 3L), .Label = c("emergence",
"end_grain_fill", "end_of_juvenil", "floral_initiat", "flowering",
"germination", "maturity", "out", "sowing", "start_grain_fi"
), class = "factor"), node_no = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 2, 2.032, 2.085, 2.157), maxt = c(18,
17.5, 18, 19.5, 17, 15.5, 14, 19, 18.5, 16, 16, 15, 16.5,
16.5, 20.5, 23, 25.5, 16.5, 16.5, 15), mint = c(6.5, 2.5,
2.5, 2, 4.5, -0.5, 5.5, 8, 7.5, 3.5, 6, 1, 5.5, 2, 7, 7,
9, 13.5, 11.5, 8.5), Cultivar = c("Axe", "Axe", "Axe", "Axe",
"Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe",
"Axe", "Axe", "Axe", "Axe", "Axe", "Axe", "Axe"), Site = c("Birchip",
"Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip",
"Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip",
"Birchip", "Birchip", "Birchip", "Birchip", "Birchip", "Birchip",
"Birchip"), Planting = c("1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1"), `NA` = c("Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out",
"Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out", "Birchip TOS1 Axe.out"
), tt = c(12.25, 10, 10.25, 10.75, 10.75, 7.5, 9.75, 13.5,
13, 9.75, 11, 8, 11, 9.25, 13.75, 15, 17.25, 15, 14, 11.75
), cum_tt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), .Names = c("Date", "flowering_das", "Zadok",
"stagename", "node_no", "maxt", "mint", "Cultivar", "Site", "Planting",
NA, "tt", "cum_tt"), row.names = c(NA, 20L), class = "data.frame")
UPDATE
Thanks everyone for your help. I'm still new to the vector way of doing things and I wasn't able to implement some of the more complex solutions in time. I've got some timing below for the way Subs suggested. It's fast enough now to do what I need. These number are in seconds for one iteration of Z over P.
my way: 59.77
Subs: 14.62
Subs using numeric date: 11.12
Sure that we can do this in a few seconds... my first answer on here so go gentle!
## first make sure we have dates in a suitable format for comparison
## by using strptime, creating the columns estdate_tidy and date_tidy
## in Z31 and pred respectively
Z31$estdate_tidy = strptime(as.character(Z31$Est.Date), "%d/%m/%Y")
pred$date_tidy = strptime(as.character(pred$Date), "%d/%m/%Y")
## now map the estdate_tidy column over to pred using the match command -
## Z31_m and pred_m are dummy variables that hopefully make this clear
Z31_m = paste(Z31$Site, Z31$Cultivar, Z31$Planting)
pred_m = paste(pred$Site, pred$Cultivar, pred$Planting)
pred$estdate_tidy = Z31$estdate_tidy[match(pred_m, Z31_m)]
## then define a ttfilter column that copies tt, except for being 0 when
## estdate_tidy is after date_tidy (think this is what you described)
pred$ttfilter = ifelse(pred$date_tidy >= pred$estdate_tidy, pred$tt, 0)
## finally use cumsum function to sum this new column up (looks like you
## wanted the answer in cum_tt so I've put it there)
pred$cum_tt = cumsum(pred$ttfilter)
Hope this helps :)
UPDATE (7 June):
The vectorised code to tackle the new specification - i.e. that the cumsum should be done separately for each set of conditions (Site/Cultivar/Planting) - is shown below:
Z31$Lookup=with(Z31,paste(Site,Cultivar,Planting,sep="~"))
Z31$LookupNum=match(Z31$Lookup,unique(Z31$Lookup))
pred$Lookup=with(pred,paste(Site,Cultivar,Planting,sep="~"))
pred$LookupNum=match(pred$Lookup,unique(pred$Lookup))
pred$Est.Date = Z31$Est.Date[match(pred$Lookup,Z31$Lookup)]
pred$ttValid = (pred$Date>=pred$Est.Date)
pred$ttFiltered = ifelse(pred$ttValid, pred$tt, 0)
### now fill in cumsum of ttFiltered separately for each LookupNum
pred$cum_tt_Z31 = as.vector(unlist(tapply(pred$ttFiltered,
pred$LookupNum,cumsum)))
Runtime is 0.16 seconds on my machine, and the final pred$cum_tt_Z31
column matches exactly with the answer from the non-vectorised code :)
For completeness, it's worth noting that the final complicated tapply line above can be replaced by the following simpler approach with a short loop over the 48 possible cases:
pred$cum_tt_Z31 = rep(NA, nrow(pred))
for (lookup in unique(pred$Lookup)) {
subs = which(pred$Lookup==lookup)
pred$cum_tt_Z31[subs] = cumsum(pred$ttFiltered[subs])
}
The runtime only increases slightly to 0.25 seconds or so, because the loop here is very small and the work done inside the loop is vectorised.
Think we've cracked it! :)
Some quick observations on vectorisation (8 June):
The process of vectorising the steps of the process brought runtime down from close to an hour to 0.16 seconds in total. Even allowing for different machine speeds, this is a speedup by at least a factor of 10,000, which dwarfs the factors of 2-5 one might get from making minor tweaks but still retaining a loop structure.
The first key observation to make: in the solution, every line creates - without looping - an entire new vector with the same length as the columns in Z31 or pred. For neatness, I often find it useful to create these new vectors as new data frame columns, but obviously this is not strictly necessary.
Second observation: the required Est.Date column gets transferred correctly from Z31 to pred using a "paste'n'match" strategy. There are alternative approaches to this sort of task (e.g. using merge), but I take this route as it is completely failsafe and guarantees to preserve the order in pred (which is crucial). Essentially the paste operation just lets you match up several fields at once, because if the pasted strings match then all their constituent parts match. I use ~ as a separator (provided I know that symbol won't appear in any of the fields) to avoid the paste operation causing any ambiguity. If you use a space separator, then pasting together something like ("A B", "C", "D") would give the same result as pasting ("A", "B C", "D") - and we want to avoid any headaches!
Third observation: it's easy to vectorise logical operations such as seeing if one vector exceeds another (see pred$ttValid), or choosing one value or another based on the value of a vector (see pred$ttFiltered). In the current situation, these could be combined into one line, but I broke things down a little more as a demonstration.
Fourth observation: the final line that creates pred$cum_tt_Z31 essentially just does a cumsum operation across the rows corresponding to each separate value of pred$LookupNum, using tapply, which allows you to apply the same function over different groups of rows (here, we're grouping by pred$LookupNum). The definition of pred$LookupNum helps enormously here - it's a numeric index with a block of 1s, followed by a block of 2s, and so on. This means the final list of cumsum vectors that comes out of the tapply can simply be unlisted and put into a vector and is automatically in the right order. If you do a tapply and split by groups that aren't ordered like this, you normally need a couple of additional lines of code to match things back up again correctly (though it isn't tricky).
Final observation: if the final tapply is too scary, it's worth highlighting that a quick loop over a small number of cases (48 say) isn't always disastrous if the work within the loop is nicely vectorised. The "alternative method" at the end of the UPDATE section shows that the cumsum-on-groups step can also be achieved by pre-preparing an answer column (initially all NAs) and then going through the 48 subsets one by one and filling in the each block with the appropriate cumsum. As noted in the text, though, this single step is roughly half the speed as the clever tapply approach, and if more subsets were needed this would be considerably slower.
If anyone has any follow up questions on this kind of task, please don't hesitate to give me a shout.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With