Consider the following:
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(vector)
{
counts <- vector()
counts[1] <- 0
for (i in 2:length(vector))
{
temp <- vector[1:i]
counts <- c(counts, sum(temp < vector[i]))
}
return(counts)
}
This gives me the correct output:
count_above(df$X)
[1] 0 1 1 0 2
For instance, the (column) vector here is
5000
6000
5500
5000
5300
At the very top 5000, there are no values above it. So this gives value 0.
At the 6000, there is one value which is above it and is less than 6000: the 5000. So this gives value 1.
At the 5500, there are two values above it, one of which is less than 5500, so this gives value 1, and so forth.
Is there any way I can write this out without using a loop?
Another approach, quite similar to aichao's solution (but a bit shorter)
X <- c(5000, 6000, 5500, 5000, 5300)
indices <- 1:length(X)
count_above <- colSums(outer(X, X, "<") & outer(indices, indices, "<"))
## [1] 0 1 1 0 2
Edit (Performance): Perhaps my idea was selected as the accepted answer because it is short and self-explaining code - but be careful to use it on large vectors! It's the slowest approach of all the solutions suggested here! Similar to that what dracodoc did, I also did a microbenchmark. But I used a random generated vector of 3000 values to get more significant run times:
count_above_loop <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above_outer <- function(X) {
indices <- 1:length(X)
colSums(outer(X, X, "<") & outer(indices, indices, "<"))
}
count_above_apply <- function(X) {
sapply(seq_len(length(X)), function(i) sum(X[i:1] < X[i]))
}
X <- runif(3000)
microbenchmark::microbenchmark(count_above_loop(X),
count_above_apply(X),
count_above_outer(X), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
count_above_loop(X) 56.27923 58.17195 62.07571 60.08123 63.92010 77.31658 10 a
count_above_apply(X) 54.41776 55.07511 57.12006 57.22372 58.61982 59.95037 10 a
count_above_outer(X) 121.12352 125.56072 132.45728 130.08141 137.08873 154.28419 10 b
We see that the apply approach on a large vector and without the overhead of a data frame is slightly faster than the for-loop.
My outer product approach takes more than double the time.
So I would recommend to use the for-loop - it's also readable and faster. My approach might be considered if you want to have provable correct code (as this one-liner is quite near to a specification of the problem)
Consider a running conditional count with sapply(). Though this is still a loop, it is a vectorized method:
count_above <- sapply(seq_len(nrow(df)),
function(i) sum(df[i:1, c("X")] < df$X[i]))
count_above
# [1] 0 1 1 0 2
EDIT: I should use bigger dataset for benchmark, the tiny dataset make the benchmark results a little bit misleading. See PatrickRoocks's update.
I just commented that for loop is not necessarily bad than apply family then I saw this.
I did a microbenchmark comparing a optimized for loop and the sapply method. for loop is 6 times faster. The sapply method is not a proper function, modifying it into a function taking a vector instead of assuming data frame columns could improve a little bit.
df <- data.frame(X = c(5000, 6000, 5500, 5000, 5300))
count_above <- function(v)
{
counts <- integer(length = length(v))
counts[1] <- 0
for (i in 2:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
count_above(df$X)
microbenchmark::microbenchmark(count_above(df$X), sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i])), times = 10)
Unit: microseconds
expr
count_above(df$X)
sapply(seq_len(nrow(df)), function(i) sum(df[i:1, c("X")] < df$X[i]))
min lq mean median uq max neval cld
38.623 41.068 65.0722 55.0010 65.512 142.757 10 a
262.045 269.379 368.6231 339.2905 415.067 640.934 10 b
Update:
# modify Parfait's answer into a function, taking vector instead of data frame
count_above_2 <- function(v){
counts <- sapply(seq_len(length(v)),
function(i) sum(v[i:1] < v[i]))
return(counts)
}
X <- df$X
microbenchmark::microbenchmark(count_above(X), count_above_2(X), {indices <- 1:length(X); colSums(outer(X, X, "<") & outer(indices, indices, "<"))}, times = 100)
Unit: microseconds
expr
count_above(X)
count_above_2(X)
{ indices <- 1:length(X) colSums(outer(X, X, "<") & outer(indices, indices, "<")) }
min lq mean median uq max neval cld
21.023 23.4680 39.02878 26.1565 35.4450 144.224 100 a
41.067 49.3785 67.06162 53.2900 70.1565 166.712 100 b
37.646 40.0900 66.45059 53.0450 72.8455 258.623 100 b
For loop still wins.
Transfer a vector instead of df$X save time for all, so I give 3 solutions same vector to be comparable.
Parfait's answer is comparable with PatrickRoocks's.
Besides performance, there is a subtle point of correctness.
OP's function and Parfait's sum(v[i:1] < v[i]) give correct answer only because v[i] < v[i] is FALSE. By definition it should use v[1:(i-1)] < v[i].
My function can be written in a more concise version like this:
count_above <- function(v)
{
counts <- integer(length = length(v))
for (i in 1:length(v))
{
counts[i] <- sum(v[1:(i-1)] < v[i])
}
return(counts)
}
It looks better and give correct result. This also depend on v[1] < v[1] is FALSE. It is not necessarily wrong since it is only about the first row, though I would still prefer the longer but more obvious version.
Another approach (still a loop because of colSums):
xg <- expand.grid(df$X,df$X)
o <- matrix(xg$Var1 < xg$Var2, nrow=length(x))
o[lower.tri(o)] <- FALSE
count_above <- colSums(o)
##[1] 0 1 1 0 2
This will most likely not be as efficient as Parfait's answer, but it is an alternative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With