I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible? I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck. <hr> <pre class="prettyprint"><code>> set.seed(123) > > #create pairwise data > a<-data.table(t(combn(3,2))) > #create column that is equal both ways, 1*2 == 2*1 > dat<-a[,data:=V1*V2] > dat V1 V2 data 1: 1 2 2 2: 1 3 3 3: 2 3 6 #The id ==2 is the problem here, the mean should be 4 ((2+6)/2) > #set keys > setkey(dat,V1,V2) > > #One way data > dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1] > dat V1 V2 data MEAN VAR 1: 1 2 2 2.5 0.5 2: 1 3 3 2.5 0.5 3: 2 3 6 6.0 NA > #The other way > dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2] > dat V1 V2 data MEAN VAR 1: 1 2 2 2.0 NA 2: 1 3 3 4.5 4.5 3: 2 3 6 4.5 4.5 > > #The intersect just produces the original data > dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)] > dat V1 V2 data MEAN VAR 1: 1 2 2 2 NA 2: 1 3 3 3 NA 3: 2 3 6 6 NA > > #Meaningless but hopefull attempt. > dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2] > dat V1 V2 data MEAN VAR 1: 1 2 2 3.666667 4.333333 2: 1 3 3 3.666667 4.333333 3: 2 3 6 3.666667 4.333333 #The goal is to create a table would look like this (using mean as an example) ID MEAN 1 2.5 2 4.0 3 4.5 </code></pre> My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns. Thank you!

It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of <code>data</code> below not to type <code>c(data, data)</code> in the first part): <pre class="prettyprint"><code>dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1] # V1 MEAN #1: 1 2.5 #2: 2 4.0 #3: 3 4.5 </code></pre>

R Data.table for computing summary stats across multiple columns

Tags:

r

data.table

bigdata

I have a similar question to: R: data.table : searching on multiple columns AND setting data type , but this question did not get fully answered. I have a pairwise table that looks conceptually like the one below. The table is the result of converting a very large distance matrix into a data.table (> 100,000,000 rows), such that the comparison a,b is the same as b,a. However a and b may appear in either column V1 or V2. I want to compute simple summary statistics using data.table's querying style, but i haven't quite figured out how to select keys in either column. Is this possible?

I've tried setting keys in either direction, but this returns just the data for that column. I also tried using list(), but that returns the intersection (understandably), i hoped for a by=key1|key2, but no such luck.

> set.seed(123)
> 
> #create pairwise data
> a<-data.table(t(combn(3,2)))
> #create column that is equal both ways, 1*2 == 2*1
> dat<-a[,data:=V1*V2]
> dat
   V1 V2 data
1:  1  2    2
2:  1  3    3
3:  2  3    6
#The id ==2 is the problem here, the mean should be 4 ((2+6)/2)

> #set keys
> setkey(dat,V1,V2)
> 
> #One way data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2  2.5 0.5
2:  1  3    3  2.5 0.5
3:  2  3    6  6.0  NA

> #The other way
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V2]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2  2.0  NA
2:  1  3    3  4.5 4.5
3:  2  3    6  4.5 4.5
> 
> #The intersect just produces the original data
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=list(V1,V2)]
> dat
   V1 V2 data MEAN VAR
1:  1  2    2    2  NA
2:  1  3    3    3  NA
3:  2  3    6    6  NA
> 
> #Meaningless but hopefull attempt. 
> dat[,c("MEAN","VAR"):=list(mean(data),var(data)),by=V1|V2]
> dat
   V1 V2 data     MEAN      VAR
1:  1  2    2 3.666667 4.333333
2:  1  3    3 3.666667 4.333333
3:  2  3    6 3.666667 4.333333
#The goal is to create a table would look like this (using mean as an example)
ID MEAN
 1  2.5
 2  4.0
 3  4.5

My default ideas would be too loop through a dat[V1==x|V2==x] statement, but i don't think i'm harnessing the full power of data.table to return a single column of ids with mean the var as summary columns.

Thank you!

784

asked May 07 '14 14:05

bw4sz

1 Answers

It'll be easiest to rearrange your data a little to achieve what you want (I'm using recycling of data below not to type c(data, data) in the first part):

dat[, list(c(V1, V2), data)][, list(MEAN = mean(data)), by = V1]
#   V1 MEAN
#1:  1  2.5
#2:  2  4.0
#3:  3  4.5

167

answered Oct 28 '22 12:10

eddi

Related questions
                            
                                Standardized output of test statistics with \Sexpr
                            
                                Assigning names to the list output of dplyr do operation
                            
                                remove a row containing missing value in specific columns in R [duplicate]
                            
                                View built-in dataset from a package
                            
                                Stacking data.frames in a list into a single data.frame, maintaining names(list) as an extra column
                            
                                R: How to filter/smooth binary signal
                            
                                How to install doRedis package version 1.0.5 into R 3.0.1 on Windows? [duplicate]
                            
                                Performance Analytics error Error in na.omit.xts(x) : unsupported type
                            
                                Select row from data.table with min value
                            
                                How can I tell if R is still estimating my SVM model or has crashed?
                            
                                Rename variable names in stargazer latex table
                            
                                Precipitation plot, or mirrored histogram based on top axis
                            
                                ggmap map style repository? Now that CloudMade no longer gives out APIs
                            
                                Performance difference between RcppArmadillo and Armadillo
                            
                                How do I convert a logical variable to factor in Rattle
                            
                                Error when using rbind to merge data.tables and one of them is empty
                            
                                How to check in every row in a column if it contains a substring
                            
                                How to write lp object to lp file?
                            
                                How to combine and modify ggplot2 legends with ribbons and lines?
                            
                                Using fread() to select rows and columns, the way read.csv.sql() does

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With