Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Data.table for computing summary stats across multiple columns

I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?

I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.


> set.seed(123)
> 
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
   V1 V2 data
1:  1  2    2
2:  1  3    3
3:  2  3    6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)

> #set keys
> setkey(dat,V1,V2)
> 
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2  2.5 0.5
2:  1  3    3  2.5 0.5
3:  2  3    6  6.0  NA

> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2  2.0  NA
2:  1  3    3  4.5 4.5
3:  2  3    6  4.5 4.5
> 
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2    2  NA
2:  1  3    3    3  NA
3:  2  3    6    6  NA
> 
> #Meaningless but hopefull attempt. 
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
   V1 V2 data     MEAN      VAR
1:  1  2    2 3.666667 4.333333
2:  1  3    3 3.666667 4.333333
3:  2  3    6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
 1  2.5
 2  4.0
 3  4.5

My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.

Thank you!

like image 784
bw4sz Avatar asked May 07 '14 14:05

bw4sz


People also ask

How do I create a summary statistics table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.

How do I select multiple columns in a data table in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

What is a summary statistic table?

The summary table is a visualization that summarizes statistical information about data in table form. The information is based on one data table in TIBCO Spotfire. You can, at any time, choose which measures you want to see (such as mean, median, etc.), as well as the columns on which to base these measures.


1 Answers

It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):

dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
#   V1 MEAN
#1:  1  2.5
#2:  2  4.0
#3:  3  4.5
like image 167
eddi Avatar answered Oct 28 '22 12:10

eddi