Suppose we have the following data. The rows represent a country and the columns (in05:in09
) indicate whether that country was present in a database of interest in the given year (2005:2009
).
id <- c("a", "b", "c", "d")
in05 <- c(1, 0, 0, 1)
in06 <- c(0, 0, 0, 1)
in07 <- c(1, 1, 0, 1)
in08 <- c(0, 1, 1, 1)
in09 <- c(0, 0, 0, 1)
df <- data.frame(id, in05, in06, in07, in08, in09)
I want to create a variable firstyear
which indicates the first year in which the country was present in the database. Right now I do the following:
df$firstyear <- ifelse(df$in05==1,2005,
ifelse(df$in06==1,2006,
ifelse(df$in07==1, 2007,
ifelse(df$in08==1, 2008,
ifelse(df$in09==1, 2009,
0)))))
The above code is already not very nice, and my dataset contains many more years. Is there an alternative, using *apply
functions, loops or something else, to create this firstyear
variable?
Place one If Statement inside another If Statement called as Nested If Else in R Programming. The If Else statement allows us to print different statements depending upon the expression result (TRUE, or FALSE). Sometimes we have to check further when the condition is TRUE.
if statement: specifies a block of statements if the condition is true. elif statement: specifies a new condition to test, if the first condition is false. else statement: specifies a block of statements if the condition is false.
The ifelse function is used to assign one object or another depending on whether the first argument, test, is TRUE or FALSE. It even works as one would hope when test is a vector. test – A logical expression, which may be a vector.
Use the IF function, one of the logical functions, to return one value if a condition is true and another value if it's false. For example: =IF(A2>B2,"Over Budget","OK")
You can vectorize using max.col
indx <- names(df)[max.col(df[-1], ties.method = "first") + 1L]
df$firstyear <- as.numeric(sub("in", "20", indx))
df
# id in05 in06 in07 in08 in09 firstyear
# 1 a 1 0 1 0 0 2005
# 2 b 0 0 1 1 0 2007
# 3 c 0 0 0 1 0 2008
# 4 d 1 1 1 1 1 2005
df$FirstYear <- gsub('in', '20', names(df))[apply(df, 1, match, x=1)]
df
id in05 in06 in07 in08 in09 FirstYear
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
There are many ways to do it. I used match
because it will find the first instance of a specified value. The other parts of the code are for presentation. First going line by line with apply
and naming the years by the column names with names
. The assignment <-
and df$FirstYear
is a way to add the result to the data frame.
added credit @David Arenburg has a cool idea about subbing the in
for 20
in the FirstYear
column.
Another answer with some notes of efficiency (although this QA is not about speed).
Firstly, it could be better to avoid the conversion of a "list"-y structure to a "matrix"; sometimes it's worth to convert to a "matrix" and use a function that handles efficiently a 'vector with a "dim" attribute' (i.e. a "matrix"/"array") - other times it's not. Both max.col
and apply
convert to a "matrix".
Secondly, in situations like these, where we do not need to check all the data while getting to a solution, we could benefit from a solution with a loop that controls what goes through to the next iteration. Here we know that we can stop when we've found the first "1". Both max.col
(and which.max
) have to loop once to, actually, find the maximum value; the fact that we know that "max == 1" is not taken advantage of.
Thirdly, match
is potentially slower when we seek only one value in another vector of values because match
's setup is rather complicated and costly:
x = 5; set.seed(199); tab = sample(1e6)
identical(match(x, tab), which.max(x == tab))
#[1] TRUE
microbenchmark::microbenchmark(match(x, tab), which.max(x == tab), times = 25)
#Unit: milliseconds
# expr min lq median uq max neval
# match(x, tab) 142.22327 142.50103 142.79737 143.19547 145.37669 25
# which.max(x == tab) 18.91427 18.93728 18.96225 19.58932 38.34253 25
To sum up, a way to work on the "list" structure of a "data.frame" and to stop computations when we find a "1", could be a loop like the following:
ff = function(x)
{
x = as.list(x)
ans = as.integer(x[[1]])
for(i in 2:length(x)) {
inds = ans == 0L
if(!any(inds)) return(ans)
ans[inds] = i * (x[[i]][inds] == 1)
}
return(ans)
}
And the solutions in the other answers (ignoring the extra steps for the output):
david = function(x) max.col(x, "first")
plafort = function(x) apply(x, 1, match, x = 1)
ff(df[-1])
#[1] 1 3 4 1
david(df[-1])
#[1] 1 3 4 1
plafort(df[-1])
#[1] 1 3 4 1
And some benchmarks:
set.seed(007)
DF = data.frame(id = seq_len(1e6),
"colnames<-"(matrix(sample(0:1, 1e7, T, c(0.25, 0.75)), 1e6),
paste("in", 11:20, sep = "")))
identical(ff(DF[-1]), david(DF[-1]))
#[1] TRUE
identical(ff(DF[-1]), plafort(DF[-1]))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF[-1]), david(DF[-1]), as.matrix(DF[-1]), times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(DF[-1]) 64.83577 65.45432 67.87486 70.32073 86.72838 30
# david(DF[-1]) 112.74108 115.12361 120.16118 132.04803 145.45819 30
# as.matrix(DF[-1]) 20.87947 22.01819 27.52460 32.60509 45.84561 30
system.time(plafort(DF[-1]))
# user system elapsed
# 4.117 0.000 4.125
Not really an apocalypse, but worth to see that simple, straightforward algorithmic approaches can -indeed- prove to be equally good or even better depending on the problem. Obviously, (most) other times looping in R can be laborious.
You can use dplyr::case_when
inside dplyr::mutate()
along the lines of the method presented in this tweet.
# Using version 0.5.0.
# Dev version may work without `with()`.
df %>%
mutate(., firstyear = with(., case_when(
in05 == 1 ~ 2005,
in06 == 1 ~ 2006,
in07 == 1 ~ 2007,
in08 == 1 ~ 2008,
in09 == 1 ~ 2009,
TRUE ~ 0
)))
Here is another option:
years <- as.integer(substr(names(df[-1]), 3, 4)) + 2000L
cbind(df, yr=do.call(pmin.int, Map(`/`, years, df[-1])))
Produces:
id in05 in06 in07 in08 in09 yr
1 a 1 0 1 0 0 2005
2 b 0 0 1 1 0 2007
3 c 0 0 0 1 0 2008
4 d 1 1 1 1 1 2005
And is fast. Here timing only the finding the min year step using Alexis' data:
Unit: milliseconds
expr min lq median uq max neval
do.call(pmin.int, Map(`/`, 11:20, DF[-1])) 178.46993 194.3760 219.8898 229.1597 307.1120 10
ff(DF[-1]) 416.07297 434.0792 439.1970 452.8345 496.2048 10
max.col(DF[-1], "first") 99.71936 138.2285 175.2334 207.6365 239.6519 10
Oddly this doesn't reproduce Alexis' timings, showing David's as the fastest. This is on R 3.1.2.
EDIT: based on convo with Frank, I updated Alexis function to be more compatible with R 3.1.2:
ff2 = function(x) {
ans = as.integer(x[[1]])
for(i in 2:length(x)) {
inds = which(ans == 0L)
if(!length(inds)) return(ans)
ans[inds] = i * (x[[i]][inds] == 1)
}
return(ans)
}
And this comes closer to the original results:
Unit: milliseconds
expr min lq median uq max neval
ff(DF[-1]) 407.92699 415.11716 421.18274 428.02092 462.2474 10
ff2(DF[-1]) 64.20484 72.74729 79.85748 81.29153 148.6439 10
I always prefer to work with tidied data. First method filters on cumsums
# Tidy
df <- df %>%
gather(year, present.or.not, -id)
# Create df of first instances
first.df <- df %>%
group_by(id, present.or.not) %>%
mutate(ranky = rank(cumsum(present.or.not)),
first.year = year) %>%
filter(ranky == 1)
# Prepare for join
first.df <- first.df[,c('id', 'first.year')]
# Join with original
df <- left_join(df,first.df)
# Spread
spread(df, year, present.or.not)
Or this alternative that, after tidying, slices the first row from arranged groups.
df %>%
gather(year, present_or_not, -id) %>%
filter(present_or_not==1) %>%
group_by(id) %>%
arrange(id, year) %>%
slice(1) %>%
mutate(year = str_replace(year, "in", "20")) %>%
select(1:2) %>%
right_join(df)`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With