I'm very new to R, and have been struggling for a couple of days to do something that Stata makes quite straightforward. A friend has given me a relatively complicated answer to this question, but I was wondering if there was a simple way to do the following.
Suppose I have a two variable dataframe, organized as below:
category var1
a 1
a 2
a 3
b 4
b 6
b 8
b 10
c 11
c 14
c 17
I want to generate five additional variables, each of which should be inserted into this same dataframe: var2
, var3
, var4
, var5
, and var6
(1) var2
is a dummy variable, which takes the value 1 for the first observation in every category (i.e., each of the three groups defined by category
), and 0 otherwise.
(2) var3
is a dummy variable, which takes the value 1 for the last observation in every category, 0 otherwise.
(3) var4
counts how many observations are in each of the group to which any particular observation belongs (i.e., so 3 for category
a, 4 for category
b, and 3 for category
c)
(4) var5
records the difference between each observation in var1
and the observation above it
(5) var6
records the difference between each observation in var1
and the observation above it, but only within the groups defined by category
.
I am reasonably well acquainted with Stata, and I find all of the above are not difficult to do, using the bysort
prefix command. For example, var1
is easily generated by bysort category: gen var1=1 if _n==1
. But I've been tearing my hair out for the last day trying to figure out how to do them using R. I'm sure there are several solutions (my friend's involved the ddplyr
package, which seemed a step above my paygrade). Is there nothing as easy as bysort
?
The final dataset should look something like this:
category var1 var2 var3 var4 var5 var6
a 1 1 0 3 n/a n/a
a 2 0 0 3 1 1
a 3 0 1 3 1 1
b 4 1 0 4 1 n/a
b 6 0 0 4 2 2
b 8 0 0 4 2 2
b 10 0 1 4 2 2
c 11 1 0 3 1 n/a
c 14 0 0 3 3 3
c 17 0 1 3 3 3
Thanks very much for any suggestions, in advance. Sorry for the rookie question; I'm sure this is answered somewhere else, but I wasn't able to find it despite hours of searching.
dat <- read.table(header = TRUE,
text =
'category var1
a 1
a 2
a 3
b 4
b 6
b 8
b 10
c 11
c 14
c 17')
(dat <- within(dat, {
var6 <- ave(var1, category, FUN = function(x) c(NA, diff(x)))
var5 <- c(NA, diff(var1))
var4 <- ave(var1, category, FUN = length)
var3 <- rev(!duplicated(rev(category))) * 1
var2 <- (!duplicated(category)) * 1
}))
# category var1 var2 var3 var4 var5 var6
# 1 a 1 1 0 3 NA NA
# 2 a 2 0 0 3 1 1
# 3 a 3 0 1 3 1 1
# 4 b 4 1 0 4 1 NA
# 5 b 6 0 0 4 2 2
# 6 b 8 0 0 4 2 2
# 7 b 10 0 1 4 2 2
# 8 c 11 1 0 3 1 NA
# 9 c 14 0 0 3 3 3
# 10 c 17 0 1 3 3 3
An answer using dplyr
library(dplyr)
dat <- dat %>%
group_by(category) %>%
mutate(var2 = ifelse(row_number() == 1, 1, 0))%>%
mutate(var3 = ifelse(row_number() == n(), 1, 0)) %>%
mutate(var4 = n()) %>%
mutate(var6 = lag(var1, 1)) %>%
ungroup() %>%
mutate(var5 = lag(var1, 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With