Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cumulative count of unique values in R

A simplified version of my data set would look like:

depth value
   1     a
   1     b
   2     a
   2     b
   2     b
   3     c

I would like to make a new data set where, for each value of "depth", I would have the cumulative number of unique values, starting from the top. e.g.

depth cumsum
 1      2
 2      2
 3      3

Any ideas as to how to do this? I am relatively new to R.

like image 897
user2223405 Avatar asked Mar 29 '13 06:03

user2223405


People also ask

How do I count unique values in R?

To find unique values in a column in a data frame, use the unique() function in R. In Exploratory Data Analysis, the unique() function is crucial since it detects and eliminates duplicate values in the data.

How do I count unique values in a row in R?

To find the number of unique values in each row of an R data frame, we can use apply function with length and unique function.

What is unique count?

The Unique Count measure gives the number of unique (distinct) values in a column. Empty values are not counted. In the table below, column A has a unique count of two and column B has a unique count of three.


1 Answers

I find this a perfect case of using factor and setting levels carefully. I'll use data.table here with this idea. Make sure your value column is character (not an absolute requirement).

  • step 1: Get your data.frame converted to data.table by taking just unique rows.

    require(data.table)
    dt <- as.data.table(unique(df))
    setkey(dt, "depth") # just to be sure before factoring "value"
    
  • step 2: Convert value to a factor and coerce to numeric. Make sure to set the levels yourself (it is important).

    dt[, id := as.numeric(factor(value, levels = unique(value)))]
    
  • step 3: Set key column to depth for subsetting and just pick the last value

     setkey(dt, "depth", "id")
     dt.out <- dt[J(unique(depth)), mult="last"][, value := NULL]
    
    #    depth id
    # 1:     1  2
    # 2:     2  2
    # 3:     3  3
    
  • step 4: Since all values in the rows with increasing depth should have at least the value of the previous row, you should use cummax to get the final output.

    dt.out[, id := cummax(id)]
    

Edit: The above code was for illustrative purposes. In reality you don't need a 3rd column at all. This is how I'd write the final code.

require(data.table)
dt <- as.data.table(unique(df))
setkey(dt, "depth")
dt[, value := as.numeric(factor(value, levels = unique(value)))]
setkey(dt, "depth", "value")
dt.out <- dt[J(unique(depth)), mult="last"]
dt.out[, value := cummax(value)]

Here's a more tricky example and the output from the code:

df <- structure(list(depth = c(1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6), 
                value = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 4L, 5L, 6L, 1L, 1L), 
                .Label = c("a", "b", "c", "d", "f", "g"), class = "factor")), 
                .Names = c("depth", "value"), row.names = c(NA, -11L), 
                class = "data.frame")
#    depth value
# 1:     1     2
# 2:     2     4
# 3:     3     4
# 4:     4     5
# 5:     5     6
# 6:     6     6
like image 114
Arun Avatar answered Oct 28 '22 22:10

Arun