Efficient Way to Incrementally Count Unique Data Points in Data Frame

Question

I am trying to find a more efficient way to incrementally count unique data points in a data frame.

For example, I have the following code written:

df = matrix(c(1,2,3,3,4,5,1,2,4,4))

count = matrix(nrow = nrow(df),ncol=1)

for (i in 1:nrow(df)) {

  count[i,1] = length(which(df[1:i,1] == df[i,1]))

}

The purpose of the code is to incrementally count each instance of a specific value, e.g. the count column will have the following result:

1,1,1,2,1,1,2,2,2,3.

The code I have written so far does the job, however the sample df above only contains 10 values. The real data frame I am trying to perform this function on contains 52,118 values, which takes an enormous amount of time.

Does anyone know of a more efficient way to execute the code above?

user227710 · Accepted Answer

data.table solution

library(data.table)
set.seed(20)
dat  <-data.frame(values = sample(1:3, 50000, replace=TRUE))
setDT(dat)[,runningCount:=1:.N,values]

       values runningCount
    1:      3            1
    2:      3            2
    3:      1            1
    4:      2            1
    5:      3            3
   ---                    
49996:      1        16674
49997:      2        16516
49998:      2        16517
49999:      2        16518
50000:      2        16519

Colonel Beauvel · Answer

One base R approach:

Reduce(`+`,lapply(unique(c(df)), function(u){b=c(df)==u;b[b==T]=cumsum(b[b==T]);b}))
#[1] 1 1 1 2 1 1 2 2 2 3

eipi10 · Answer

Here's a fast approach with the dplyr package:

library(dplyr)

# Fake data
set.seed(20)
dat  = data.frame(values = sample(1:3, 50000, replace=TRUE))

dat %>% group_by(values) %>%
  mutate(runningCount = 1:n())

   values runningCount
1       2            1
2       3            1
3       1            1
4       3            2
5       1            2
6       3            3
7       3            4
..    ...          ...

Timing (in milliseconds):

     min       lq     mean   median       uq      max  neval
2.003755 2.134762 2.198161 2.186214 2.231662 3.665328    100

Timings for all answers so far (using the data I created):

                median
        dplyr:   2.11
   data.table:   1.24
lapply/Reduce:  11.61
          ave:   9.93

So data.table is the fastest.

Efficient Way to Incrementally Count Unique Data Points in Data Frame

Tags:

dataframe

r

count

Chintan Desai

3 Answers

user227710

Colonel Beauvel

eipi10

Recent Activity

Donate For Us

Efficient Way to Incrementally Count Unique Data Points in Data Frame

Tags:

dataframe

r

count

Chintan Desai

3 Answers

user227710

Colonel Beauvel

eipi10

Related questions

Recent Activity

Donate For Us