Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I summarizing data statistics using R

Tags:

r

how can I write a short script that creates a new data frame that reports the following descriptive statistics for each column of continuous data for the survey below: mean, standard deviation, median, minimum value, maximum value, sample size?

   Distance Age Height Coning
1      21.4  18    3.3    Yes
2      13.9  17    3.4    Yes
3      23.9  16    2.9    Yes
4       8.7  18    3.6     No
5     241.8   6    0.7     No
6      44.5  17    1.3    Yes
7      30.0  15    2.5    Yes
8      32.3  16    1.8    Yes
9      31.4  17    5.0     No
10     32.8  13    1.6     No
11     53.3  12    2.0     No
12     54.3   6    0.9     No
13     96.3  11    2.6     No
14    133.6   4    0.6     No
15     32.1  15    2.3     No
16     57.9  12    2.4    Yes
17     30.8  17    1.8     No
18     59.9   7    0.8     No
19     42.7  15    2.0    Yes
20     20.6  18    1.7    Yes
21     62.0   8    1.3     No
22     53.1   7    1.6     No
23     28.9  16    2.2    Yes
24    177.4   5    1.1     No
25     24.8  14    1.5    Yes
26     75.3  14    2.3    Yes
27     51.6   7    1.4     No
28     36.1   9    1.1     No
29    116.1   6    1.1     No
30     28.1  16    2.5    Yes
31      8.7  19    2.2    Yes
32    105.1   6    0.8     No
33     46.0  15    3.0    Yes
34    102.6   7    1.2     No
35     15.8  15    2.2     No
36     60.0   7    1.3     No
37     96.4  13    2.6     No
38     24.2  14    1.7     No
39     14.5  15    2.4     No
40     36.6  14    1.5     No
41     65.7   5    0.6     No
42    116.3   7    1.6     No
43    113.6   8    1.0     No
44     16.7  15    4.3    Yes
45     66.0   7    1.0     No
46     60.7   7    1.0     No
47     90.6   7    0.7     No
48     91.3   7    1.3     No
49     14.4  18    3.1    Yes
50     72.8  14    3.0    Yes
like image 900
user3136251 Avatar asked Dec 27 '13 13:12

user3136251


People also ask

How do you summarize statistics in R?

Descriptive statistics in R (Method 1):summary statistic is computed using summary() function in R. summary() function is automatically applied to each column. The format of the result depends on the data type of the column. If the column is a numeric variable, mean, median, min, max and quartiles are returned.

How does summarize work in R?

Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.

How do I get data statistics in R?

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.

Is there a summary function in R?

summary() function in R Language is a generic function used to produce result summaries of the results of various model fitting functions.


2 Answers

You can write your own function to get such a summary into a data.frame:

# Defining the function
my.summary <- function(x, na.rm=TRUE){
  result <- c(Mean=mean(x, na.rm=na.rm),
              SD=sd(x, na.rm=na.rm),
              Median=median(x, na.rm=na.rm),
              Min=min(x, na.rm=na.rm),
              Max=max(x, na.rm=na.rm), 
              N=length(x))
}

# identifying numeric columns
ind <- sapply(df, is.numeric)


# applying the function to numeric columns only
sapply(df[, ind], my.summary)  
        Distance       Age     Height
Mean    58.67200 11.840000  1.9160000
SD      45.48137  4.604168  0.9796626
Median  48.80000 13.500000  1.7000000
Min      8.70000  4.000000  0.6000000
Max    241.80000 19.000000  5.0000000
N       50.00000 50.000000 50.0000000

Or you can use the built-in function basicStats from fBasics package for a more detailed summary:

> library(fBasics)
> basicStats(df[, ind])
               Distance        Age    Height
nobs          50.000000  50.000000 50.000000
NAs            0.000000   0.000000  0.000000
Minimum        8.700000   4.000000  0.600000
Maximum      241.800000  19.000000  5.000000
1. Quartile   28.300000   7.000000  1.125000
3. Quartile   74.675000  15.750000  2.475000
Mean          58.672000  11.840000  1.916000
Median        48.800000  13.500000  1.700000
Sum         2933.600000 592.000000 95.800000
SE Mean        6.432037   0.651128  0.138545
LCL Mean      45.746337  10.531510  1.637583
UCL Mean      71.597663  13.148490  2.194417
Variance    2068.555118  21.198367  0.959739
Stdev         45.481371   4.604168  0.979663
Skewness       1.711028  -0.158853  0.905415
Kurtosis       3.753948  -1.574527  0.578684
like image 177
Jilber Urbina Avatar answered Sep 27 '22 19:09

Jilber Urbina


The following use of do.call, rbind and sapply provides a summary for each column that has the class 'numeric'. You can write your own statistics function if you need different statistics than those of summary (see the answer of @Jilber).

mtcars$carb = as.factor(mtcars$carb)  # Forcing one column to a factor
do.call('rbind', sapply(mtcars, function(x) if(is.numeric(x)) summary(x)))
       Min. 1st Qu.  Median     Mean 3rd Qu.    Max.
mpg  10.400  15.420  19.200  20.0900   22.80  33.900
cyl   4.000   4.000   6.000   6.1880    8.00   8.000
disp 71.100 120.800 196.300 230.7000  326.00 472.000
hp   52.000  96.500 123.000 146.7000  180.00 335.000
drat  2.760   3.080   3.695   3.5970    3.92   4.930
wt    1.513   2.581   3.325   3.2170    3.61   5.424
qsec 14.500  16.890  17.710  17.8500   18.90  22.900
vs    0.000   0.000   0.000   0.4375    1.00   1.000
am    0.000   0.000   0.000   0.4062    1.00   1.000
gear  3.000   3.000   4.000   3.6880    4.00   5.000
like image 24
Paul Hiemstra Avatar answered Sep 27 '22 18:09

Paul Hiemstra