I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?
I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.
> set.seed(123)
>
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
V1 V2 data
1: 1 2 2
2: 1 3 3
3: 2 3 6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)
> #set keys
> setkey(dat,V1,V2)
>
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.5 0.5
2: 1 3 3 2.5 0.5
3: 2 3 6 6.0 NA
> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2.0 NA
2: 1 3 3 4.5 4.5
3: 2 3 6 4.5 4.5
>
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 2 NA
2: 1 3 3 3 NA
3: 2 3 6 6 NA
>
> #Meaningless but hopefull attempt.
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
V1 V2 data MEAN VAR
1: 1 2 2 3.666667 4.333333
2: 1 3 3 3.666667 4.333333
3: 2 3 6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
1 2.5
2 4.0
3 4.5
My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.
Thank you!
The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
The summary table is a visualization that summarizes statistical information about data in table form. The information is based on one data table in TIBCO Spotfire. You can, at any time, choose which measures you want to see (such as mean, median, etc.), as well as the columns on which to base these measures.
It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data
below not to type c(data, data)
in the first part):
dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
# V1 MEAN
#1: 1 2.5
#2: 2 4.0
#3: 3 4.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With