Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does one aggregate and summarize data quickly?

Tags:

r

data.table

plyr

I have a dataset whose headers look like so:

PID Time Site Rep Count

I want sum the Count by Rep for each PID x Time x Site combo

on the resulting data.frame, I want to get the mean value of Count for PID x Time x Site combo.

Current function is as follows:

dummy <- function (data)
{
A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))})
B<-aggregate(Count~PID+Time+Site,data=A,mean)
return (B)
}

This is painfully slow (original data.frame is 510000 20). Is there a way to speed this up with plyr?

like image 685
Maiasaura Avatar asked Oct 11 '11 07:10

Maiasaura


People also ask

What is an aggregated summary?

When data is aggregated, atomic data rows -- typically gathered from multiple sources -- are replaced with totals or summary statistics. Groups of observed aggregates are replaced with summary statistics based on those observations.

What is Aggregation and how does it work?

Aggregation is a mathematical operation that takes multiple values and returns a single value: operations like sum, average, count, or minimum. This changes the data to a lower granularity (aka a higher level of detail). Understanding aggregations can sometimes depend on what you're trying to accomplish.

How do you explain aggregate data?

Aggregate data is high-level data which is acquired by combining individual-level data. For instance, the output of an industry is an aggregate of the firms' individual outputs within that industry. Aggregate data are applied in statistics, data warehouses, and in economics.


1 Answers

You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
like image 154
Ramnath Avatar answered Dec 18 '22 08:12

Ramnath