Consider the following:
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(vector)
{
counts <- vector()
counts[1] <- 0
for (i in 2:length(vector))
{
temp <- vector[1:i]
counts <- c(counts, sum(temp < vector[i]))
}
return(counts)
}
This gives me the correct output:
count_above(df$X)
[1] 0 1 1 0 2
For instance, the (column) vector here is
5000
6000
5500
5000
5300
At the very top 5000
, there are no values above it. So this gives value 0
.
At the 6000
, there is one value which is above it and is less than 6000
: the 5000
. So this gives value 1
.
At the 5500
, there are two values above it, one of which is less than 5500
, so this gives value 1
, and so forth.
Is there any way I can write this out without using a loop?
Another approach, quite similar to aichao's solution (but a bit shorter)
X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2
Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:
count_above_loop <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above_outer <- function(X) {
indices <- 1:length(X)
colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}
count_above_apply <- function(X) {
sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}
X <- runif(3000)
microbenchmark::microbenchmark(count_above_loop(X),
count_above_apply(X),
count_above_outer(X), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
count_above_loop(X) 56.27923 58.17195 62.07571 60.08123 63.92010 77.31658 10 a
count_above_apply(X) 54.41776 55.07511 57.12006 57.22372 58.61982 59.95037 10 a
count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419 10 b
We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.
My outer product approach takes more than double the time.
So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)
Consider a running conditional count with sapply()
. Though this is still a loop, it is a vectorized method:
count_above <- sapply(seq_len(nrow(df)),
function(i) sum(df[i:1, c("X")] < df$X[i]))
count_above
# [1] 0 1 1 0 2
EDIT: I should use bigger dataset for benchmark, the tiny dataset make the benchmark results a little bit misleading. See PatrickRoocks's update.
I just commented that for loop is not necessarily bad than apply family then I saw this.
I did a microbenchmark comparing a optimized for loop and the sapply method. for loop is 6 times faster. The sapply method is not a proper function, modifying it into a function taking a vector instead of assuming data frame columns could improve a little bit.
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above(df$X)
microbenchmark::microbenchmark(count_above(df$X), sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i])), times = 10)
Unit: microseconds
expr
count_above(df$X)
sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i]))
min lq mean median uq max neval cld
38.623 41.068 65.0722 55.0010 65.512 142.757 10 a
262.045 269.379 368.6231 339.2905 415.067 640.934 10 b
Update:
# modify Parfait's answer into a function, taking vector instead of data frame
count_above_2 <- function(v){
counts <- sapply(seq_len(length(v)),
function(i) sum(v[i:1] < v[i]))
return(counts)
}
X <- df$X
microbenchmark::microbenchmark(count_above(X), count_above_2(X), {indices <- 1:length(X); colSums(outer(X, X, "<") & outer(indices, indices, "<"))}, times = 100)
Unit: microseconds
expr
count_above(X)
count_above_2(X)
{ indices <- 1:length(X) colSums(outer(X, X, "<") & outer(indices, indices, "<")) }
min lq mean median uq max neval cld
21.023 23.4680 39.02878 26.1565 35.4450 144.224 100 a
41.067 49.3785 67.06162 53.2900 70.1565 166.712 100 b
37.646 40.0900 66.45059 53.0450 72.8455 258.623 100 b
For loop still wins.
Transfer a vector instead of df$X
save time for all, so I give 3 solutions same vector to be comparable.
Parfait's answer is comparable with PatrickRoocks's.
Besides performance, there is a subtle point of correctness.
OP's function and Parfait's sum(v[i:1] < v[i]) give correct answer only because v[i] < v[i] is FALSE. By definition it should use v[1:(i-1)] < v[i].
My function can be written in a more concise version like this:
count_above <- function(v)
{
counts <- integer(length = length(v))
for (i in 1:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
It looks better and give correct result. This also depend on v[1] < v[1] is FALSE. It is not necessarily wrong since it is only about the first row, though I would still prefer the longer but more obvious version.
Another approach (still a loop because of colSums
):
xg <- expand.grid(df$X,df$X)
o <- matrix(xg$Var1 < xg$Var2, nrow=length(x))
o[lower.tri(o)] <- FALSE
count_above <- colSums(o)
##[1] 0 1 1 0 2
This will most likely not be as efficient as Parfait's answer, but it is an alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With