Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Way to Incrementally Count Unique Data Points in Data Frame

Tags:

dataframe

r

count

I am trying to find a more efficient way to incrementally count unique data points in a data frame.

For example, I have the following code written:

df = matrix(c(1,2,3,3,4,5,1,2,4,4))

count = matrix(nrow = nrow(df),ncol=1)

for (i in 1:nrow(df)) {

  count[i,1] = length(which(df[1:i,1] == df[i,1]))

}

The purpose of the code is to incrementally count each instance of a specific value, e.g. the count column will have the following result:

1,1,1,2,1,1,2,2,2,3.

The code I have written so far does the job, however the sample df above only contains 10 values. The real data frame I am trying to perform this function on contains 52,118 values, which takes an enormous amount of time.

Does anyone know of a more efficient way to execute the code above?

like image 631
Chintan Desai Avatar asked May 14 '15 20:05

Chintan Desai


3 Answers

data.table solution

library(data.table)
set.seed(20)
dat  <-data.frame(values = sample(1:3, 50000, replace=TRUE))
setDT(dat)[,runningCount:=1:.N,values]

       values runningCount
    1:      3            1
    2:      3            2
    3:      1            1
    4:      2            1
    5:      3            3
   ---                    
49996:      1        16674
49997:      2        16516
49998:      2        16517
49999:      2        16518
50000:      2        16519
like image 180
user227710 Avatar answered Nov 16 '22 12:11

user227710


One base R approach:

Reduce(`+`,lapply(unique(c(df)), function(u){b=c(df)==u;b[b==T]=cumsum(b[b==T]);b}))
#[1] 1 1 1 2 1 1 2 2 2 3
like image 30
Colonel Beauvel Avatar answered Nov 16 '22 12:11

Colonel Beauvel


Here's a fast approach with the dplyr package:

library(dplyr)

# Fake data
set.seed(20)
dat  = data.frame(values = sample(1:3, 50000, replace=TRUE))

dat %>% group_by(values) %>%
  mutate(runningCount = 1:n())

   values runningCount
1       2            1
2       3            1
3       1            1
4       3            2
5       1            2
6       3            3
7       3            4
..    ...          ...

Timing (in milliseconds):

     min       lq     mean   median       uq      max  neval
2.003755 2.134762 2.198161 2.186214 2.231662 3.665328    100

Timings for all answers so far (using the data I created):

                median
        dplyr:   2.11
   data.table:   1.24
lapply/Reduce:  11.61
          ave:   9.93

So data.table is the fastest.

like image 44
eipi10 Avatar answered Nov 16 '22 12:11

eipi10