These are some newbie questions about statistical programming for R for which I haven't been able to find an answer online. My dataframe is labeled "eitc" in the code below.
1) Once I've loaded in a data frame, I would like to look at summary statistics. I've used the functions:
eitc <- read.dta(file="/Users/Documents/eitc.dta")
summary(eitc)
sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc.
How do I find summary statistics on my dataframe when certain qualifications are met. For example, I would like to see the summary statistics on all variables when the variable "children" is greater than or equal to 1. The equivalent Stata code is:
summarize if children >= 1
2) Similarly, how do I find specific parameters when certain qualifications are met? For example, I want to find the mean of the variable "work" when both "post93" variable is equal to zero and "anykids" variable is equal to 1. The equivalent Stata code is:
mean work if post93==0 & anykids==1
3) Ideally, when I run the summary statistics above, I would like to find out how many observations were included in the calculation / fit the criteria.
4) When I read in my data frame, it would also be nice to see how many observations are included in the data set (and perhaps how many rows have missing values or "NA" in them).
5) Also, I have been creating dummy variables using the following code. Is this the correct way to do it or is there a more efficient route?
post93.dummy <- as.numeric(eitc$year>1993)
eitc=cbind(eitc,post93.dummy)
We can use the boxplot function to calculate quick summaries for all the variables in our data set—by default, R computes boxplots column by column.
The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library. The following examples show how to use these functions in practice.
A lot of your requirements are answered by subset
, e.g.
summary(subset(eitc, post93 == 0 & anykids == 1, select=work))
nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs.
The ?subset
documentation has good examples.
The cbind
method of attaching dummy variables is unneccesary. Just do:
eitc$post93.dummy <- as.numeric(eitc$year>1993)
I'll use mtcars
data available in datasets
package. See ?mtcars
.
Ad 1. You can see the summary of mtcars
when gear
is greater than 3:
summary(mtcars[mtcars$gear > 3, ])
## or by using Tukey's five number summary
sapply(mtcars[mtcars$gear > 3, ], fivenum)
Ad 2. Use with
:
with(mtcars, mean(hp[gear > 3 & mpg > 20]))
Ad 3. Ibid (but use length
):
with(mtcars, length(hp[gear > 3 & mpg > 20]))
## or
sapply(mtcars[mtcars$gear > 3, ], length) ## which is trivial when there are no NA's
sapply(mtcars[mtcars$gear > 3, ], length, na.rm = TRUE) ## but this one's good when there are NA's
nrow(mtcars[mtcars$gear > 3, ])
Ad 4. See previous, but to find out
how many rows have missing values or "NA" in them
do something like this:
apply(dtf, 1, function(x) length(is.na(x)))
Ad 5. This is not a dummy variable, this is some kind of subset of original data, columnwise concatenated. What are you trying to achieve anyway?
Please be concise. One question per question, please!
I would recomend you look at the plyr package for generating summaries. Here's some quick code (not run);
#Generate a new factor based on the numeric value of children with 5 levels
eitc$childfac<-cut(eitc$children,5)
# Generate mean and sd of the variables foo and bar based on that factor
ddply(eitc, .(childfac), function(df) {
return(data.frame(meanfoo=mean(df$foo), sdfoo=stdev(df$foo),
meanbar=mean(df$bar), sdbar=stdev(df$bar))
})
You might also want to look at the hmisc and psych packages for more descriptive stat routines. (Check out Quick-R for more info)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With