Imagine that I have a list of numbers (i.e. numbers column in data.table/data.frame).
1
5
5
10
11
12
for each number in a list a want to count how many unique numbers are there which are lower than that particular number + 5.
The explanation for upper case, first number = 1, search range is 1+5 = 6, so three numbers are in range, less than or equal to: c(1,5,5)
, and then count unique is 2.
This is all by assuming we've got the additional condition, that the number must not only be lower than current_number + 5, but also its index in the list must be >= that of current_number.
The result in this case would be:
2
2
2
3
2
1
Note: Is there a fast solution for huge dataset, in data.frame or data.table? My dataset is rather huge, 10+M rows.
The fastest way I can think of in base R (works if x
is sorted):
findInterval(x + 5, unique(x)) - cumsum(!duplicated(x)) + 1L
#[1] 2 2 2 3 2 1
edit: no problem with the sorting because with data.table
, sorting integers is trivial:
nr <- 1e7
nn <- nr/2
set.seed(0L)
DT <- data.table(X=sample(nn, nr, TRUE))
#DT <- data.table(X=c(1,5,5,10,11,12))
system.time(
DT[order(X),
COUNT := findInterval(X + 5L, unique(X)) - cumsum(!duplicated(X)) + 1L
]
)
# user system elapsed
# 1.73 0.17 1.53
2s for 10million rows.
Try this:
x <- c(1,5,5,10,11,12)
sapply(seq_along(x), function(i)
sum(unique(x[i:length(x)]) <= (x[i] + 5)))
# [1] 2 2 2 3 2 1
One option is to use a sql self-join
library(sqldf)
df$r <- seq(nrow(df))
sqldf('
select a.V1
, count(distinct b.V1) as n
from df a
left join df b
on b.V1 <= a.V1 + 5
and b.r >= a.r
group by a.r
')
# V1 n
# 1 1 2
# 2 5 2
# 3 5 2
# 4 10 3
# 5 11 2
# 6 12 1
Data used:
df <- structure(list(V1 = c(1L, 5L, 5L, 10L, 11L, 12L)), row.names = c(NA,
-6L), class = "data.frame")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With