I have the following data.frame df
:
df = data.frame(col1 = c('a','a','a','a','a','b','b','c','d'),
col2 = c('a','a','a','b','b','b','b','a','a'),
height1 = c(NA,32,NA,NA,NA,NA,NA,25,NA),
height2 = c(31,31.5,NA,NA,11,12,13,NA,NA),
col3 = 1:9)
# col1 col2 height1 height2 col3
#1 a a NA 31.0 1
#2 a a 32 31.5 2
#3 a a NA NA 3
#4 a b NA NA 4
#5 a b NA 11.0 5
#6 b b NA 12.0 6
#7 b b NA 13.0 7
#8 c a 25 NA 8
#9 d a NA NA 9
I want for each couple of value in col1, col2
to build a column height
containing values such that:
NA
in height1
and height2
, return NA
.height1
, take this value. (for a couple col1, col2
, there is at most one non NA
value in column height1
)NA
in height1
and some non NA
values in height2
, take the first value in height2
.I need also to keep corresponding values in column col3
.
The new data.frame
new.df
will look like:
# col1 col2 height col3
#1 a a 32 2
#2 a b 11 5
#3 b b 12 6
#4 c a 25 8
#5 d a NA 9
I would prefer a data.frame
approach, quite concise, but I realize I am unable to find one!
Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.
DataFrame - filter() function The filter() function is used to subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
Maybe not the elegant solution you are looking for but here is a base R
option:
do.call("rbind",
lapply(split(df,paste0(df$col1,df$col2)),
function(tab) {
colnames(tab)[3:4] <- "height"
out <- if(any(!is.na(tab[, 3]))) {
tab[which(!is.na(tab[,3])),-4]
} else {
if (any(!is.na(tab[,4]))) {
tab[which(!is.na(tab[,4]))[1],c(1:2,4:5)]
} else {
tab[1,-4]
}
}
return(out)
}
)
)
# col1 col2 height col3
# aa a a 32 2
# ab a b 11 5
# bb b b 12 6
# ca c a 25 8
# da d a NA 9
With dplyr:
df %>%
mutate(
order = ifelse(!is.na(height1), 1, ifelse(!is.na(height2), 2, 3)),
height = ifelse(!is.na(height1), height1, ifelse(!is.na(height2), height2, NA))
) %>%
arrange( col1, col2, order, height) %>%
distinct(col1, col2) %>%
select( col1, col2, height, col3)
I use data.table
(whereas I would like to use data.frame option exceptionaly there) and I find my solution unelegant:
func = function(df)
{
if(all(is.na(subset(df, select=c(height1,height2)))))
return(df[1,])
if(any(!is.na(df$height1)))
return(df[!is.na(df$height1),])
df[!is.na(df$height2),][1,]
}
setDT(df)
new.df=df[,func(.SD),by=list(col1,col2)]
new.df = data.frame(new.df)
new.df$height = ifelse(is.na(new.df$height1), new.df$height2, new.df$height1)
#> new.df
# col1 col2 height1 height2 col3 height
#1 a a 32 31.5 2 32
#2 a b NA 11.0 5 11
#3 b b NA 12.0 6 12
#4 c a 25 NA 8 25
#5 d a NA NA 9 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With