Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummy for first new element in a series

Suppose I have a variable that lasts for several periods. Like the amount of years that I have an Ipod. So I had the Ipod 1st generation from 2001 until 2004 and then in 2005 I've got Ipod 2 and so on. So my dataframe would look like:

  2001 Ipod1
  2002 Ipod1
  2003 Ipod1
  2004 Ipod1
  2005 Ipod2
  2006 Ipod2
  2007 Ipod2
  2008 Ipod2
  2009 Ipod3
  2010 Ipod3

What I want is to create a dummy for the period when a new variable arrives so I would get:

  Year  Var  Dummy
  2001 Ipod1  1
  2002 Ipod1  0
  2003 Ipod1  0
  2004 Ipod1  0
  2005 Ipod2  1
  2006 Ipod2  0
  2007 Ipod2  0
  2008 Ipod2  0
  2009 Ipod3  1
  2010 Ipod3  0

So far I have been able to do this:

df = structure(list(Year = 2001:2010, Var = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("Ipod1", "Ipod2", "Ipod3"
), class = "factor")), .Names = c("Year", "Var"), class = "data.frame", row.names = c(NA,
-10L))

df$number.in.group = unlist(lapply(table(df$Var),seq.int)) 
df$dummy = ifelse(df$number.in.group == 1,1,0)
df$dummy[1]=0

Actually I would like the first element of the dummy to be zero.

My question is: Is there any way of doing this in a better way?

Thanks

like image 639
aatrujillob Avatar asked Feb 03 '12 07:02

aatrujillob


4 Answers

How about this:

df$Dummy <- as.numeric(!duplicated(df$Var))

# Or, if you want the first element to be 0,
df$Dummy <- c(0, as.numeric(!duplicated(df$Var))[-1])
like image 106
Josh O'Brien Avatar answered Nov 07 '22 18:11

Josh O'Brien


I believe this gives the desired result:

> df$Dummy <- c(0, diff(as.numeric(df$Var)))
> df
   Year   Var Dummy
1  2001 Ipod1     0
2  2002 Ipod1     0
3  2003 Ipod1     0
4  2004 Ipod1     0
5  2005 Ipod2     1
6  2006 Ipod2     0
7  2007 Ipod2     0
8  2008 Ipod2     0
9  2009 Ipod3     1
10 2010 Ipod3     0

This works since Var is a factor so using as.numeric works.

like image 42
Dason Avatar answered Nov 07 '22 17:11

Dason


The rle function is very useful in these kinds of situations. It finds consecutive runs of the same item in a vector.

rle_result = rle(as.character(df$Var))
rle_result
Run Length Encoding
  lengths: int [1:3] 4 4 2
  values : chr [1:3] "Ipod1" "Ipod2" "Ipod3"

To construct your new variable:

df$new = 0
change_ids = 1 + cumsum(rle_result$lengths)
df$new[change_ids[-length(change_ids)]] <- 1
df
   Year   Var new
1  2001 Ipod1   0
2  2002 Ipod1   0
3  2003 Ipod1   0
4  2004 Ipod1   0
5  2005 Ipod2   1
6  2006 Ipod2   0
7  2007 Ipod2   0
8  2008 Ipod2   0
9  2009 Ipod3   1
10 2010 Ipod3   0

which is exactly what you where looking for I think.

like image 44
Paul Hiemstra Avatar answered Nov 07 '22 17:11

Paul Hiemstra


(1) The question asked for a Dummy column but the sample answer in the question also produced a number.in.group column so I was not sure whether the number.in.group column was required or not; however, below we assume it is needed. Note that the assignment of 0 to the first element of Dummy has the effect of converting that column to numeric:

within(df, {
    number.in.group <- ave(Year, Var, FUN = seq_along)
    Dummy <- number.in.group == 1
    Dummy[1] <- 0
})

(2a) If number.in.group is not needed and the groups in Var are contiguous as in the example then the duplicated solution already presented would be preferable except I think it would be slightly clearer if it were written like this:

df$Dummy <- !duplicated(df$Var)
df$Dummy[1] <- 0

even though that requires one additional statement.

(2b) Also we might prefer a non-destructive form:

within(df, {
    Dummy <- !duplicated(Var)
    Dummy[1] <- 0
})
like image 44
G. Grothendieck Avatar answered Nov 07 '22 18:11

G. Grothendieck