Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Summary function inside Data.table

Tags:

r

data.table

I am learning data.table using examples and I am stuck-up with my own scenario.

I am using cars dataset and converted to a data.table for trying my commands.

library(data.table)
> cars.dt=data.table(cars)
> cars.dt[1:5]
   speed dist
1:     4    2
2:     4   10
3:     7    4
4:     7   22
5:     8   16
.
.

I wanted to calculate the summary statistics for each group of speed and store it in different columns but the values are stored in multiple rows.

e.g

 > cars.dt[, summary(dist), by="speed"]
      speed V1
   1:     4  2
   2:     4  4
   3:     4  6
   4:     4  6
   5:     4  8
  ---         
 110:    25 85
 111:    25 85
 112:    25 85
 113:    25 85
 114:    25 85

I was expecting the below output and I am unable to achieve it.

    speed   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 1:     4      2       4       6       6       8      10 
 2:     7    4.0     8.5    13.0    13.0    17.5    22.0 
 3:     8     16      16      16      16      16      16 
 4:     9     10      10      10      10      10      10 
 5:    10     18      22      26      26      30      34 
 6:    11  17.00   19.75   22.50   22.50   25.25   28.00 
 7:    12   14.0    18.5    22.0    21.5    25.0    28.0 
 8:    13     26      32      34      35      37      46 
 9:    14   26.0    33.5    48.0    50.5    65.0    80.0 
10:    15  20.00   23.00   26.00   33.33   40.00   54.00 
11:    16     32      34      36      36      38      40 
12:    17  32.00   36.00   40.00   40.67   45.00   50.00 
13:    18   42.0    52.5    66.0    64.5    78.0    84.0 
14:    19     36      41      46      50      57      68 
15:    20   32.0    48.0    52.0    50.4    56.0    64.0 
16:    22     66      66      66      66      66      66 
17:    23     54      54      54      54      54      54 
18:    24  70.00   86.50   92.50   93.75   99.75  120.00 
19:    25     85      85      85      85      85      85 

I tried the below command but the output was not in a data.table

> cars.dt[, print(summary(dist)), by="speed"] 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2       4       6       6       8      10 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0     8.5    13.0    13.0    17.5    22.0 
...
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  70.00   86.50   92.50   93.75   99.75  120.00 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     85      85      85      85      85      85 
 Empty data.table (0 rows) of 1 col: speed

I am unable to use functions returning multiple values when using by clause.

If anyone has any idea as to how to write this, it would be much appreciated.

Also let me know if this possible in data.table

like image 640
Manuel Avatar asked Sep 19 '14 07:09

Manuel


People also ask

What is a summary data table?

A summary table is a new spreadsheet that instead of having all of the data, has new data that has statistics computed from the original data. See the Data Statistics Chapter of the wikibook for a discussion of some of the data statistics that you can use in summary tables.

Can we put table in summary?

The summary may also be helpful for simple data tables that contain many columns or rows of data. The summary attribute may be used whether or not the table includes a caption element. If both are used, the summary should not duplicate the caption. The summary attribute on the table element is obsolete.


1 Answers

Try:

 dt1 <- cars.dt[, as.list(summary(dist)), by="speed"]
 head(dt1)
 #    speed Min. 1st Qu. Median Mean 3rd Qu. Max.
 #1:     4    2    4.00    6.0  6.0    8.00   10
 #2:     7    4    8.50   13.0 13.0   17.50   22
 #3:     8   16   16.00   16.0 16.0   16.00   16
 #4:     9   10   10.00   10.0 10.0   10.00   10
 #5:    10   18   22.00   26.0 26.0   30.00   34
 #6:    11   17   19.75   22.5 22.5   25.25   28

You could also consider summaryBy from doBy to have some control over the summary functions to output.

 library(doBy)
 dt2 <- summaryBy(.~speed, cars.dt, FUN=c(min, median, mean, max))
 head(dt2,2)
 #   speed dist.min dist.median dist.mean dist.max
 #1:     4        2           6         6       10
 #2:     7        4          13        13       22

I guess the difference in as.list and list argument is:

Without the grouping variable

 list(summary(cars.dt$speed))  #this gets a `list` with one `list element`
 #[[1]]
 # Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 4.0    12.0    15.0    15.4    19.0    25.0 

as.list(summary(cars.dt$speed)) #whereas this is also a list with multiple elements
# $Min.
#[1] 4

#$`1st Qu.`
#[1] 12

 #$Median
 #[1] 15

#$Mean
#[1] 15.4

#$`3rd Qu.`
#[1] 19

#$Max.
#[1] 25

same as list(1:5) and as.list(1:5)

like image 100
akrun Avatar answered Oct 15 '22 08:10

akrun