Alternatives to nested ifelse statements in R

Tags:

Suppose we have the following data. The rows represent a country and the columns (in05:in09) indicate whether that country was present in a database of interest in the given year (2005:2009).

id <- c("a", "b", "c", "d")
in05 <- c(1, 0, 0, 1)
in06 <- c(0, 0, 0, 1)
in07 <- c(1, 1, 0, 1)
in08 <- c(0, 1, 1, 1)
in09 <- c(0, 0, 0, 1)
df <- data.frame(id, in05, in06, in07, in08, in09)

I want to create a variable firstyear which indicates the first year in which the country was present in the database. Right now I do the following:

df$firstyear <- ifelse(df$in05==1,2005,
    ifelse(df$in06==1,2006,
        ifelse(df$in07==1, 2007,
            ifelse(df$in08==1, 2008,
                ifelse(df$in09==1, 2009,
                    0)))))

The above code is already not very nice, and my dataset contains many more years. Is there an alternative, using *apply functions, loops or something else, to create this firstyear variable?

780

asked May 27 '15 22:05

6 Answers

You can vectorize using max.col

indx <- names(df)[max.col(df[-1], ties.method = "first") + 1L]
df$firstyear <- as.numeric(sub("in", "20", indx))
df
#   id in05 in06 in07 in08 in09 firstyear
# 1  a    1    0    1    0    0      2005
# 2  b    0    0    1    1    0      2007
# 3  c    0    0    0    1    0      2008
# 4  d    1    1    1    1    1      2005

163

answered Oct 01 '22 18:10

David Arenburg

df$FirstYear <- gsub('in', '20', names(df))[apply(df, 1, match, x=1)]
df
  id in05 in06 in07 in08 in09 FirstYear
1  a    1    0    1    0    0      2005
2  b    0    0    1    1    0      2007
3  c    0    0    0    1    0      2008
4  d    1    1    1    1    1      2005

There are many ways to do it. I used match because it will find the first instance of a specified value. The other parts of the code are for presentation. First going line by line with apply and naming the years by the column names with names. The assignment <- and df$FirstYear is a way to add the result to the data frame.

added credit @David Arenburg has a cool idea about subbing the in for 20 in the FirstYear column.

answered Oct 01 '22 18:10

Pierre L

Another answer with some notes of efficiency (although this QA is not about speed).

Firstly, it could be better to avoid the conversion of a "list"-y structure to a "matrix"; sometimes it's worth to convert to a "matrix" and use a function that handles efficiently a 'vector with a "dim" attribute' (i.e. a "matrix"/"array") - other times it's not. Both max.col and apply convert to a "matrix".

Secondly, in situations like these, where we do not need to check all the data while getting to a solution, we could benefit from a solution with a loop that controls what goes through to the next iteration. Here we know that we can stop when we've found the first "1". Both max.col (and which.max) have to loop once to, actually, find the maximum value; the fact that we know that "max == 1" is not taken advantage of.

Thirdly, match is potentially slower when we seek only one value in another vector of values because match's setup is rather complicated and costly:

x = 5; set.seed(199); tab = sample(1e6)
identical(match(x, tab), which.max(x == tab))
#[1] TRUE
microbenchmark::microbenchmark(match(x, tab), which.max(x == tab), times = 25)
#Unit: milliseconds
#                expr       min        lq    median        uq       max neval
#       match(x, tab) 142.22327 142.50103 142.79737 143.19547 145.37669    25
# which.max(x == tab)  18.91427  18.93728  18.96225  19.58932  38.34253    25

To sum up, a way to work on the "list" structure of a "data.frame" and to stop computations when we find a "1", could be a loop like the following:

ff = function(x)
{
    x = as.list(x)
    ans = as.integer(x[[1]])
    for(i in 2:length(x)) {
        inds = ans == 0L
        if(!any(inds)) return(ans)
        ans[inds] = i * (x[[i]][inds] == 1)
    }
    return(ans)
}

And the solutions in the other answers (ignoring the extra steps for the output):

david = function(x) max.col(x, "first")
plafort = function(x) apply(x, 1, match, x = 1)

ff(df[-1])
#[1] 1 3 4 1
david(df[-1])
#[1] 1 3 4 1
plafort(df[-1])
#[1] 1 3 4 1

And some benchmarks:

set.seed(007)
DF = data.frame(id = seq_len(1e6),
                "colnames<-"(matrix(sample(0:1, 1e7, T, c(0.25, 0.75)), 1e6), 
                             paste("in", 11:20, sep = "")))
identical(ff(DF[-1]), david(DF[-1]))
#[1] TRUE
identical(ff(DF[-1]), plafort(DF[-1]))
#[1] TRUE
microbenchmark::microbenchmark(ff(DF[-1]), david(DF[-1]), as.matrix(DF[-1]), times = 30)
#Unit: milliseconds
#              expr       min        lq    median        uq       max neval
#        ff(DF[-1])  64.83577  65.45432  67.87486  70.32073  86.72838    30
#     david(DF[-1]) 112.74108 115.12361 120.16118 132.04803 145.45819    30
# as.matrix(DF[-1])  20.87947  22.01819  27.52460  32.60509  45.84561    30

system.time(plafort(DF[-1]))
#   user  system elapsed 
#  4.117   0.000   4.125

Not really an apocalypse, but worth to see that simple, straightforward algorithmic approaches can -indeed- prove to be equally good or even better depending on the problem. Obviously, (most) other times looping in R can be laborious.

answered Oct 01 '22 17:10

alexis_laz

You can use dplyr::case_when inside dplyr::mutate() along the lines of the method presented in this tweet.

# Using version 0.5.0.
# Dev version may work without `with()`.    
df %>%
      mutate(., firstyear = with(., case_when(
        in05 == 1 ~ 2005,
        in06 == 1 ~ 2006,
        in07 == 1 ~ 2007,
        in08 == 1 ~ 2008,
        in09 == 1 ~ 2009,
        TRUE ~ 0
)))

answered Oct 01 '22 17:10

seasmith

Here is another option:

years <- as.integer(substr(names(df[-1]), 3, 4)) + 2000L
cbind(df, yr=do.call(pmin.int, Map(`/`, years, df[-1])))

Produces:

  id in05 in06 in07 in08 in09   yr
1  a    1    0    1    0    0 2005
2  b    0    0    1    1    0 2007
3  c    0    0    0    1    0 2008
4  d    1    1    1    1    1 2005

And is fast. Here timing only the finding the min year step using Alexis' data:

Unit: milliseconds
                                       expr       min       lq   median       uq      max neval
 do.call(pmin.int, Map(`/`, 11:20, DF[-1])) 178.46993 194.3760 219.8898 229.1597 307.1120    10
                                 ff(DF[-1]) 416.07297 434.0792 439.1970 452.8345 496.2048    10
                   max.col(DF[-1], "first")  99.71936 138.2285 175.2334 207.6365 239.6519    10

Oddly this doesn't reproduce Alexis' timings, showing David's as the fastest. This is on R 3.1.2.

EDIT: based on convo with Frank, I updated Alexis function to be more compatible with R 3.1.2:

ff2 = function(x) {
  ans = as.integer(x[[1]])
  for(i in 2:length(x)) {
      inds = which(ans == 0L)
      if(!length(inds)) return(ans)
      ans[inds] = i * (x[[i]][inds] == 1)
  }
  return(ans)
}

And this comes closer to the original results:

Unit: milliseconds
        expr       min        lq    median        uq      max neval
  ff(DF[-1]) 407.92699 415.11716 421.18274 428.02092 462.2474    10
 ff2(DF[-1])  64.20484  72.74729  79.85748  81.29153 148.6439    10

answered Oct 01 '22 17:10

BrodieG

I always prefer to work with tidied data. First method filters on cumsums

# Tidy
df <- df %>% 
  gather(year, present.or.not, -id) 

# Create df of first instances
first.df <- df %>% 
  group_by(id, present.or.not) %>% 
  mutate(ranky = rank(cumsum(present.or.not)), 
         first.year = year) %>% 
  filter(ranky == 1)

# Prepare for join
first.df <- first.df[,c('id', 'first.year')]

# Join with original
df <- left_join(df,first.df)

# Spread
spread(df, year, present.or.not)

Or this alternative that, after tidying, slices the first row from arranged groups.

df %>% 
  gather(year, present_or_not, -id) %>% 
  filter(present_or_not==1) %>% 
  group_by(id) %>% 
  arrange(id, year) %>% 
  slice(1) %>% 
  mutate(year = str_replace(year, "in", "20")) %>% 
  select(1:2) %>% 
  right_join(df)`

answered Oct 01 '22 18:10

Nettle

Related questions
                            
                                How can I paste 100000 without it being shortened to 1e+05? [duplicate]
                            
                                How to skip an error in a loop
                            
                                non-numeric argument to binary operator [closed]
                            
                                Including Bibliography in RMarkdown document with use of the knitcitations
                            
                                R function prcomp fails with NA's values even though NA's are allowed
                            
                                How do I position two legends independently in ggplot
                            
                                How to add a factor column to dataframe based on a conditional statement from another column?
                            
                                Centering a plot within a fluidRow in Shiny
                            
                                Downloading Yahoo stock prices in R
                            
                                Subset based on variable column name
                            
                                Selecting non-consecutive columns in R tables
                            
                                turning off case sensitivity in r
                            
                                How to add header to a dataset in R?
                            
                                R Markdown Math Equation Alignment
                            
                                How to include RMarkdown file in r package? [duplicate]
                            
                                Combine lists while overriding values with same name in R
                            
                                Paste/Collapse in R
                            
                                How to rank within groups in R?
                            
                                Writing to specific schemas with RPostgreSQL
                            
                                What is the difference between NaN and Inf, and NULL and NA in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Alternatives to nested ifelse statements in R

Tags:

loops

r

nested-loops

if-statement

Kat

People also ask

6 Answers

David Arenburg

Pierre L

alexis_laz

seasmith

BrodieG

Nettle

Recent Activity

Donate For Us