Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Language: How do I print / see summary statistics for sample subset?

These are some newbie questions about statistical programming for R for which I haven't been able to find an answer online. My dataframe is labeled "eitc" in the code below.

1) Once I've loaded in a data frame, I would like to look at summary statistics. I've used the functions:

eitc <- read.dta(file="/Users/Documents/eitc.dta")
summary(eitc)
sapply(eitc,mean,na.rm=TRUE) #for sample mean, min, max, etc.

How do I find summary statistics on my dataframe when certain qualifications are met. For example, I would like to see the summary statistics on all variables when the variable "children" is greater than or equal to 1. The equivalent Stata code is:

summarize if children >= 1

2) Similarly, how do I find specific parameters when certain qualifications are met? For example, I want to find the mean of the variable "work" when both "post93" variable is equal to zero and "anykids" variable is equal to 1. The equivalent Stata code is:

mean work if post93==0 & anykids==1

3) Ideally, when I run the summary statistics above, I would like to find out how many observations were included in the calculation / fit the criteria.

4) When I read in my data frame, it would also be nice to see how many observations are included in the data set (and perhaps how many rows have missing values or "NA" in them).

5) Also, I have been creating dummy variables using the following code. Is this the correct way to do it or is there a more efficient route?

post93.dummy <- as.numeric(eitc$year>1993)
eitc=cbind(eitc,post93.dummy)
like image 719
baha-kev Avatar asked Jan 29 '11 08:01

baha-kev


People also ask

How do you visualize summary statistics in R?

We can use the boxplot function to calculate quick summaries for all the variables in our data set—by default, R computes boxplots column by column.

How do I present a summary table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library. The following examples show how to use these functions in practice.


3 Answers

A lot of your requirements are answered by subset, e.g.

summary(subset(eitc, post93 == 0 & anykids == 1, select=work))
nrow(subset(eitc, post93 == 0 & anykids == 1, select=work)) # for number of obs.

The ?subset documentation has good examples.

The cbind method of attaching dummy variables is unneccesary. Just do:

eitc$post93.dummy <- as.numeric(eitc$year>1993) 
like image 135
Michael Dunn Avatar answered Sep 28 '22 08:09

Michael Dunn


I'll use mtcars data available in datasets package. See ?mtcars.

Ad 1. You can see the summary of mtcars when gear is greater than 3:

summary(mtcars[mtcars$gear > 3, ])
## or by using Tukey's five number summary
sapply(mtcars[mtcars$gear > 3, ], fivenum)

Ad 2. Use with:

with(mtcars, mean(hp[gear > 3 & mpg > 20]))

Ad 3. Ibid (but use length):

with(mtcars, length(hp[gear > 3 & mpg > 20]))
## or
sapply(mtcars[mtcars$gear > 3, ], length) ## which is trivial when there are no NA's
sapply(mtcars[mtcars$gear > 3, ], length, na.rm = TRUE) ## but this one's good when there are NA's
nrow(mtcars[mtcars$gear > 3, ])

Ad 4. See previous, but to find out

how many rows have missing values or "NA" in them

do something like this:

apply(dtf, 1, function(x) length(is.na(x)))

Ad 5. This is not a dummy variable, this is some kind of subset of original data, columnwise concatenated. What are you trying to achieve anyway?

Please be concise. One question per question, please!

like image 38
aL3xa Avatar answered Sep 28 '22 08:09

aL3xa


I would recomend you look at the plyr package for generating summaries. Here's some quick code (not run);

#Generate a new factor based on the numeric value of children with 5 levels
eitc$childfac<-cut(eitc$children,5)

# Generate mean and sd of the variables foo and bar based on that factor
ddply(eitc, .(childfac), function(df) {
  return(data.frame(meanfoo=mean(df$foo), sdfoo=stdev(df$foo),
    meanbar=mean(df$bar), sdbar=stdev(df$bar))
  })

You might also want to look at the hmisc and psych packages for more descriptive stat routines. (Check out Quick-R for more info)

like image 38
PaulHurleyuk Avatar answered Sep 28 '22 07:09

PaulHurleyuk