Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean and Median Vs Summary

I'm currently doing a Reproducible Data course on Coursera and one of the questions ask for the Mean and Median of steps per day, I have this but when I confirm it with the summary function, the summary version of Mean and Median is different. I'm running this via knitr

Why would this be? ** below is an edit showing all of my script so far including a link to the raw data:

##Download the data You have to change https to http to get this to work in knitr

target_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
target_localfile = "ActivityMonitoringData.zip"
if (!file.exists(target_localfile)) {
  download.file(target_url, destfile = target_localfile) 
}
Unzip the file to the temporary directory

unzip(target_localfile, exdir="extract", overwrite=TRUE)
List the extracted files

list.files("./extract")
## [1] "activity.csv"
Load the extracted data into R

activity.csv <- read.csv("./extract/activity.csv", header = TRUE)
activity1 <- activity.csv[complete.cases(activity.csv),]
str(activity1)
## 'data.frame':    15264 obs. of  3 variables:
##  $ steps   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
Use a histogram to view the number of steps taken each day

histData <- aggregate(steps ~ date, data = activity1, sum)
h <- hist(histData$steps,  # Save histogram as object
          breaks = 11,  # "Suggests" 11 bins
          freq = T,
          col = "thistle1", 
          main = "Histogram of Activity",
          xlab = "Number of daily steps")


Obtain the Mean and Median of the daily steps

steps <- histData$steps
mean(steps)
## [1] 10766
median(steps)
## [1] 10765
summary(histData$steps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8840   10800   10800   13300   21200
summary(steps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8840   10800   10800   13300   21200
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.6
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0    stringr_0.6.2  tools_3.1.1
like image 892
Chris Avatar asked Oct 14 '14 11:10

Chris


People also ask

Is the mean or median a better summary measure?

In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

Is median or mode a better way to summarize the data?

The median is usually preferred to other measures of central tendency when your data set is skewed (i.e., forms a skewed distribution) or you are dealing with ordinal data. However, the mode can also be appropriate in these situations, but is not as commonly used as the median.

Why would you use the median instead of the mean to summarize data?

The reason to choose the median is that it carries more information about the distribution than the mode and it is unambiguously acceptable for ordinal data (e.g., using the mean could be controversial, see: Calculate mean of ordinal variable).

What is the difference between mean and median in data analysis sampling?

What is the difference between mean and median? Mean is the average value of set of given data and median is the middle value when the data set is arranged in an order either ascending or descending.


1 Answers

Actually, the answers is correct, you just printing it wrong. You are setting digits option somewhere.

Put this before the scripts:

options(digits=12)

And you'll have:

mean(steps)
# [1] 10766.1886792
median(steps)
# [1] 10765
summary(steps)
#      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
#   41.0000  8841.0000 10765.0000 10766.1887 13294.0000 21194.0000 

Notice that summary use max(3, getOption("digits")-3) for how many numbers is printed. So it round it a bit (10766.1887 instead of 10766.1886792).

like image 171
m0nhawk Avatar answered Oct 07 '22 21:10

m0nhawk