Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating summary statistic across subsets of dataset [What is the equivalent of Stata's "bysort" in R?]

Tags:

r

stata

I have been programming in Stata the last few years and have recently switched to R about 4 months ago.

I have data in the following format:

       popname sex year age COUNTRY
329447     AUS   f 1921  23     AUS
329448     AUS   f 1921  24     AUS
329449     AUS   f 1921  25     AUS
329450     AUS   f 1921  26     AUS
329451     AUS   f 1921  27     AUS
329452     AUS   f 1921  28     AUS
...
329532     AUS   f 1922  23     AUS
329533     AUS   f 1922  24     AUS
329534     AUS   f 1922  25     AUS
...        ...   .  ..   ..     ...
297729     BLR   f 1987  59     BLR
297730     BLR   f 1987  60     BLR
297731     BLR   f 1987  61     BLR
... 
291941     BLR   m 1973  71     BLR
291942     BLR   m 1973  72     BLR
291993     BLR   m 1974  23     BLR

I would like to create a new summary variable called Max.Age (which calculates the maximum Age for a given subgroup defined by {popname, sex,year) in the existing dataset as follows:

   popname sex year age COUNTRY   max.age
329447     AUS   f 1921  23     AUS   72  
329448     AUS   f 1921  24     AUS   72
329449     AUS   f 1921  25     AUS   72
329450     AUS   f 1921  26     AUS   72
329451     AUS   f 1921  27     AUS   72
329452     AUS   f 1921  28     AUS   72
...
329532     AUS   f 1922  23     AUS   75
329533     AUS   f 1922  24     AUS   75
329534     AUS   f 1922  25     AUS   75
...        ...   .  ..   ..     ...
297729     BLR   f 1987  59     BLR   87
297730     BLR   f 1987  60     BLR   87
297731     BLR   f 1987  61     BLR   87
... 
291941     BLR   m 1973  71     BLR   78
291942     BLR   m 1973  72     BLR   78
291993     BLR   m 1974  23     BLR   78

To do this in Stata, one would use the egen command with the by command as follows:

by State City Day, sort:
egen cnt=seq(), from(23) to(72) block(1);  

I tried doing this in R, using the doBy package. Here's the code I wrote:

IDB <- orderBy(~popname+sex+year+age, data=IDB)
v<-lapplyBy(~sex+year, data=IDB, function(d) c(NA,max(d$age)))
IDB$Max.age <- unlist(v)

This doesn't work, as lapplyBy returns an aggregated dataset of smaller length than the original dataset (IDB).

Could someone kindly point me in the right direction on how to essentially implement a "by | egen" type Stata code in R?

Thanks

like image 845
Anupa Fabian Avatar asked Dec 16 '22 12:12

Anupa Fabian


1 Answers

One thing you'll find with R is that there isn't just one way to do things. One way is via the ave function.

IDB$max.age <- ave(IDB$age, IDB$popname, IDB$sex, IDB$year, FUN=max)
like image 195
Joshua Ulrich Avatar answered Jan 30 '23 20:01

Joshua Ulrich