Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is there an equivalent to Stata's egen function? [duplicate]

Tags:

r

stata

Stata has a very nice command, egen, which makes it easy to compute statistics over group of observation. For instance, it is possible to compute the max, the mean and the min for each group and add them as a variable in the detailed data set. The Stata command is one line of code :

by group : egen max = max(x)

I've never found the same command in R. summarise in the dplyr package makes it easy to compute statistics for each group but then I have to run a loop to associate the statistic to each observation :

library("dplyr")
N  <- 1000
tf  <- data.frame(group = sample(1:100, size = N, replace = TRUE), x = rnorm(N))
table(tf$group)
mtf  <- summarise(group_by(tbl_df(tf), group), max = max(x))
tf$max  <- NA
for (i in 1:nrow(mtf)) {
  tf$max[tf$group == mtf$group[i]]  <- mtf$max[i]
}

Does any one has a better solution ?

like image 452
PAC Avatar asked Jun 11 '14 11:06

PAC


People also ask

What is the Egen command in Stata?

The Stata command egen, which stands for extended generation, is used to create variables that require some additional function in order to be generated. Examples of these function include taking the mean, discretizing a continuous variable, and counting how many from a set of variables have missing values.

What does _n mean in Stata?

Introduction. Stata has two built-in variables called _n and _N. _n is Stata notation for the current observation number. _n is 1 in the first observation, 2 in the second, 3 in the third, and so on. _N is Stata notation for the total number of observations.

What does Bysort mean in Stata?

by and bysort are really the same command; bysort is just by with the sort option. The varlist1 (varlist2) syntax is of special use to programmers. It verifies that the data are sorted. by varlist1 varlist2 and then performs a by as if only varlist1 were specified.

What does ISID do in Stata?

isid checks whether the specified variables uniquely identify the observations. sort specifies that the dataset be sorted by varlist. missok indicates that missing values are permitted in varlist. Suppose that we want to check whether the mileage ratings (mpg) uniquely identify the observations in our auto dataset.


1 Answers

Here are a few approaches:

dplyr

library(dplyr)

tf %>% group_by(group) %>% mutate(max = max(x))

ave

This uses only the base of R:

transform(tf, max = ave(x, group, FUN = max))

data.table

library(data.table)

dt <- data.table(tf)
dt[, max:=max(x), by=group]
like image 177
G. Grothendieck Avatar answered Sep 18 '22 15:09

G. Grothendieck