I am trying to find a more efficient way to incrementally count unique data points in a data frame.
For example, I have the following code written:
df = matrix(c(1,2,3,3,4,5,1,2,4,4))
count = matrix(nrow = nrow(df),ncol=1)
for (i in 1:nrow(df)) {
count[i,1] = length(which(df[1:i,1] == df[i,1]))
}
The purpose of the code is to incrementally count each instance of a specific value, e.g. the count column will have the following result:
1,1,1,2,1,1,2,2,2,3.
The code I have written so far does the job, however the sample df above only contains 10 values. The real data frame I am trying to perform this function on contains 52,118 values
, which takes an enormous amount of time.
Does anyone know of a more efficient way to execute the code above?
data.table solution
library(data.table)
set.seed(20)
dat <-data.frame(values = sample(1:3, 50000, replace=TRUE))
setDT(dat)[,runningCount:=1:.N,values]
values runningCount
1: 3 1
2: 3 2
3: 1 1
4: 2 1
5: 3 3
---
49996: 1 16674
49997: 2 16516
49998: 2 16517
49999: 2 16518
50000: 2 16519
One base R
approach:
Reduce(`+`,lapply(unique(c(df)), function(u){b=c(df)==u;b[b==T]=cumsum(b[b==T]);b}))
#[1] 1 1 1 2 1 1 2 2 2 3
Here's a fast approach with the dplyr
package:
library(dplyr)
# Fake data
set.seed(20)
dat = data.frame(values = sample(1:3, 50000, replace=TRUE))
dat %>% group_by(values) %>%
mutate(runningCount = 1:n())
values runningCount
1 2 1
2 3 1
3 1 1
4 3 2
5 1 2
6 3 3
7 3 4
.. ... ...
Timing (in milliseconds):
min lq mean median uq max neval
2.003755 2.134762 2.198161 2.186214 2.231662 3.665328 100
Timings for all answers so far (using the data I created):
median
dplyr: 2.11
data.table: 1.24
lapply/Reduce: 11.61
ave: 9.93
So data.table
is the fastest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With