I have a dataset whose headers look like so:
PID Time Site Rep Count
I want sum the Count
by Rep
for each PID x Time x Site combo
on the resulting data.frame, I want to get the mean value of Count
for PID x Time x Site
combo.
Current function is as follows:
dummy <- function (data)
{
A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))})
B<-aggregate(Count~PID+Time+Site,data=A,mean)
return (B)
}
This is painfully slow (original data.frame is 510000 20)
. Is there a way to speed this up with plyr?
When data is aggregated, atomic data rows -- typically gathered from multiple sources -- are replaced with totals or summary statistics. Groups of observed aggregates are replaced with summary statistics based on those observations.
Aggregation is a mathematical operation that takes multiple values and returns a single value: operations like sum, average, count, or minimum. This changes the data to a lower granularity (aka a higher level of detail). Understanding aggregations can sometimes depend on what you're trying to accomplish.
Aggregate data is high-level data which is acquired by combining individual-level data. For instance, the output of an industry is an aggregate of the firms' individual outputs within that industry. Aggregate data are applied in statistics, data warehouses, and in economics.
You should look at the package data.table
for faster aggregation operations on large data frames. For your problem, the solution would look like:
library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With