Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching for a straightforward way to do Stata's bysort tasks in R

Tags:

r

stata

I'm very new to R, and have been struggling for a couple of days to do something that Stata makes quite straightforward. A friend has given me a relatively complicated answer to this question, but I was wondering if there was a simple way to do the following.

Suppose I have a two variable dataframe, organized as below:

category    var1 
a            1     
a            2    
a            3     
b            4     
b            6  
b            8   
b           10    
c           11     
c           14      
c           17

I want to generate five additional variables, each of which should be inserted into this same dataframe: var2, var3, var4, var5, and var6

(1) var2 is a dummy variable, which takes the value 1 for the first observation in every category (i.e., each of the three groups defined by category), and 0 otherwise.

(2) var3 is a dummy variable, which takes the value 1 for the last observation in every category, 0 otherwise.

(3) var4 counts how many observations are in each of the group to which any particular observation belongs (i.e., so 3 for category a, 4 for category b, and 3 for category c)

(4) var5 records the difference between each observation in var1 and the observation above it

(5) var6 records the difference between each observation in var1 and the observation above it, but only within the groups defined by category.

I am reasonably well acquainted with Stata, and I find all of the above are not difficult to do, using the bysort prefix command. For example, var1 is easily generated by bysort category: gen var1=1 if _n==1. But I've been tearing my hair out for the last day trying to figure out how to do them using R. I'm sure there are several solutions (my friend's involved the ddplyr package, which seemed a step above my paygrade). Is there nothing as easy as bysort?

The final dataset should look something like this:

category    var1     var2     var3     var4     var5    var6   
a            1        1        0        3       n/a      n/a
a            2        0        0        3        1        1
a            3        0        1        3        1        1
b            4        1        0        4        1       n/a
b            6        0        0        4        2        2
b            8        0        0        4        2        2
b           10        0        1        4        2        2
c           11        1        0        3        1       n/a
c           14        0        0        3        3        3
c           17        0        1        3        3        3

Thanks very much for any suggestions, in advance. Sorry for the rookie question; I'm sure this is answered somewhere else, but I wasn't able to find it despite hours of searching.

like image 330
daanoo Avatar asked Aug 13 '14 02:08

daanoo


2 Answers

dat <- read.table(header = TRUE, 
           text = 
'category    var1 
a            1     
a            2    
a            3     
b            4     
b            6  
b            8   
b           10    
c           11     
c           14      
c           17')


(dat <- within(dat, {
  var6 <- ave(var1, category, FUN = function(x) c(NA, diff(x)))
  var5 <- c(NA, diff(var1))
  var4 <- ave(var1, category, FUN = length)
  var3 <- rev(!duplicated(rev(category))) * 1
  var2 <- (!duplicated(category)) * 1
}))

#    category var1 var2 var3 var4 var5 var6
# 1         a    1    1    0    3   NA   NA
# 2         a    2    0    0    3    1    1
# 3         a    3    0    1    3    1    1
# 4         b    4    1    0    4    1   NA
# 5         b    6    0    0    4    2    2
# 6         b    8    0    0    4    2    2
# 7         b   10    0    1    4    2    2
# 8         c   11    1    0    3    1   NA
# 9         c   14    0    0    3    3    3
# 10        c   17    0    1    3    3    3
like image 182
rawr Avatar answered Nov 06 '22 00:11

rawr


An answer using dplyr

library(dplyr)
dat <- dat %>%
 group_by(category) %>%
 mutate(var2 = ifelse(row_number() == 1, 1, 0))%>%
 mutate(var3 = ifelse(row_number() == n(), 1, 0)) %>%
 mutate(var4 = n()) %>%
 mutate(var6 = lag(var1, 1)) %>%
 ungroup() %>%
 mutate(var5 = lag(var1, 1))
like image 8
Matthew Avatar answered Nov 06 '22 01:11

Matthew