I have the following data.frame df:
df = data.frame(col1    = c('a','a','a','a','a','b','b','c','d'),
                col2    = c('a','a','a','b','b','b','b','a','a'),
                height1 = c(NA,32,NA,NA,NA,NA,NA,25,NA),
                height2 = c(31,31.5,NA,NA,11,12,13,NA,NA),
                col3    = 1:9)
#  col1 col2 height1 height2 col3
#1    a    a      NA    31.0    1
#2    a    a      32    31.5    2
#3    a    a      NA      NA    3
#4    a    b      NA      NA    4
#5    a    b      NA    11.0    5
#6    b    b      NA    12.0    6
#7    b    b      NA    13.0    7
#8    c    a      25      NA    8
#9    d    a      NA      NA    9
I want for each couple of value in col1, col2 to build a column height containing values such that:
NA in height1 and height2, return NA.height1, take this value. (for a couple col1, col2, there is at most one non NA value in column height1)NA in height1 and some non NA values in height2, take the first value in height2.I need also to keep corresponding values in column col3.
The new data.frame new.df will look like:
#  col1 col2 height col3
#1    a    a     32    2
#2    a    b     11    5
#3    b    b     12    6
#4    c    a     25    8
#5    d    a     NA    9
I would prefer a data.frame approach, quite concise, but I realize I am unable to find one!
Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.
DataFrame - filter() function The filter() function is used to subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
Maybe not the elegant solution you are looking for but here is a base R option:
do.call("rbind",
        lapply(split(df,paste0(df$col1,df$col2)),
               function(tab) {
                 colnames(tab)[3:4] <- "height" 
                 out <- if(any(!is.na(tab[, 3]))) {
                           tab[which(!is.na(tab[,3])),-4]
                        } else {
                           if (any(!is.na(tab[,4]))) {
                              tab[which(!is.na(tab[,4]))[1],c(1:2,4:5)]
                           } else {
                              tab[1,-4]
                           }
                        }
                return(out)
               }
        )
      )
#       col1 col2 height col3
#    aa    a    a     32    2
#    ab    a    b     11    5
#    bb    b    b     12    6
#    ca    c    a     25    8
#    da    d    a     NA    9
                        With dplyr:
df %>%
  mutate( 
    order = ifelse(!is.na(height1), 1, ifelse(!is.na(height2), 2, 3)),
    height = ifelse(!is.na(height1), height1, ifelse(!is.na(height2), height2, NA))
    ) %>%
  arrange( col1, col2, order, height) %>%
  distinct(col1, col2) %>%
  select( col1, col2, height, col3)
                        I use data.table (whereas I would like to use data.frame option exceptionaly there) and I find my solution unelegant:
func = function(df)
{
    if(all(is.na(subset(df, select=c(height1,height2)))))
        return(df[1,])
    if(any(!is.na(df$height1)))
        return(df[!is.na(df$height1),])
    df[!is.na(df$height2),][1,]
}
setDT(df)
new.df=df[,func(.SD),by=list(col1,col2)]
new.df = data.frame(new.df)
new.df$height = ifelse(is.na(new.df$height1), new.df$height2, new.df$height1)
#> new.df
#  col1 col2 height1 height2 col3 height
#1    a    a      32    31.5    2     32
#2    a    b      NA    11.0    5     11
#3    b    b      NA    12.0    6     12
#4    c    a      25      NA    8     25
#5    d    a      NA      NA    9     NA
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With