I'm currently doing a Reproducible Data course on Coursera and one of the questions ask for the Mean and Median of steps per day, I have this but when I confirm it with the summary function, the summary version of Mean and Median is different. I'm running this via knitr
Why would this be? ** below is an edit showing all of my script so far including a link to the raw data:
##Download the data You have to change https to http to get this to work in knitr
target_url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
target_localfile = "ActivityMonitoringData.zip"
if (!file.exists(target_localfile)) {
download.file(target_url, destfile = target_localfile)
}
Unzip the file to the temporary directory
unzip(target_localfile, exdir="extract", overwrite=TRUE)
List the extracted files
list.files("./extract")
## [1] "activity.csv"
Load the extracted data into R
activity.csv <- read.csv("./extract/activity.csv", header = TRUE)
activity1 <- activity.csv[complete.cases(activity.csv),]
str(activity1)
## 'data.frame': 15264 obs. of 3 variables:
## $ steps : int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Use a histogram to view the number of steps taken each day
histData <- aggregate(steps ~ date, data = activity1, sum)
h <- hist(histData$steps, # Save histogram as object
breaks = 11, # "Suggests" 11 bins
freq = T,
col = "thistle1",
main = "Histogram of Activity",
xlab = "Number of daily steps")
Obtain the Mean and Median of the daily steps
steps <- histData$steps
mean(steps)
## [1] 10766
median(steps)
## [1] 10765
summary(histData$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
summary(steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8840 10800 10800 13300 21200
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0 stringr_0.6.2 tools_3.1.1
In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.
The median is usually preferred to other measures of central tendency when your data set is skewed (i.e., forms a skewed distribution) or you are dealing with ordinal data. However, the mode can also be appropriate in these situations, but is not as commonly used as the median.
The reason to choose the median is that it carries more information about the distribution than the mode and it is unambiguously acceptable for ordinal data (e.g., using the mean could be controversial, see: Calculate mean of ordinal variable).
What is the difference between mean and median? Mean is the average value of set of given data and median is the middle value when the data set is arranged in an order either ascending or descending.
Actually, the answers is correct, you just printing it wrong. You are setting digits
option somewhere.
Put this before the scripts:
options(digits=12)
And you'll have:
mean(steps)
# [1] 10766.1886792
median(steps)
# [1] 10765
summary(steps)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 41.0000 8841.0000 10765.0000 10766.1887 13294.0000 21194.0000
Notice that summary
use max(3, getOption("digits")-3)
for how many numbers is printed. So it round it a bit (10766.1887 instead of 10766.1886792).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With