How can I efficiently extract group-wise constant columns from a data frame? I've included an plyr implementation below to make precise what I'm trying to do, but it's slow. How can I do it as efficiently as possible? (Ideally without splitting the data frame at all).
base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
base[rep(seq_len(nrow(base)), length = 1e6), ],
c = runif(1e6),
d = runif(1e6)
)
is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
# user system elapsed
# 20.531 1.670 22.378
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)
In my real use case (deep inside ggplot2) there may be an arbitrary number of constant and non-constant columns. The size of the data in the example is about the right order of magnitude.
(Edited to possibly address the issue of consecutive groups with the same value)
I'm tentatively submitting this answer, but I haven't completely convinced myself that it will correctly identify within group constant columns in all cases. But it's definitely faster (and can probably be improved):
constant_cols1 <- function(df,grp){
df <- df[order(df[,grp]),]
#Adjust values based on max diff in data
rle_group <- rle(df[,grp])
vec <- rep(rep(c(0,ceiling(diff(range(df)))),
length.out = length(rle_group$lengths)),
times = rle_group$lengths)
m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1)
df_new <- df
df_new[,-1] <- df[,-1] + m
rles <- lapply(df_new,FUN = rle)
nms <- names(rles)
tmp <- sapply(rles[nms != grp],
FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)})
return(tmp)
}
My basic idea was to use rle
, obviously.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With