Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add count of unique / distinct values by group to the original data

I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:

  color  type 1 black chair 2 black chair 3 black  sofa 4 green  sofa 5 green  sofa 6   red  sofa 7   red plate 8  blue  sofa 9  blue plate 10 blue chair 

I want to add for each color, the count of unique types that are present in the data:

  color  type unique_types 1 black chair            2 2 black chair            2 3 black  sofa            2 4 green  sofa            1 5 green  sofa            1 6   red  sofa            2 7   red plate            2 8  blue  sofa            3 9  blue plate            3 10 blue chair            3 

I was hoping to use ave, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.

It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame

like image 629
Bryan Avatar asked Jul 02 '13 09:07

Bryan


People also ask

How do you count unique values in a data set?

read_csv() function in which pass the path and name of the dataset. Select the column in which you want to check or count the unique values. For finding unique values we are using unique() function provided by pandas and stored it in a variable, let named as 'unique_values'.


2 Answers

Here's a solution with the dplyr package - it has n_distinct() as a wrapper for length(unique()).

df %>%   group_by(color) %>%   mutate(unique_types = n_distinct(type)) 
like image 140
Sam Firke Avatar answered Sep 28 '22 19:09

Sam Firke


Using ave (since you ask for it specifically):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))}) 

Make sure that type is character vector and not factor.


Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

require(data.table) setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+ # if you don't want df to be modified by reference ans = as.data.table(df)[, count := uniqueN(type), by = color] 

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.


Other solutions:

Using plyr:

require(plyr) ddply(df, .(color), mutate, count = length(unique(type))) 

Using aggregate:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x))) merge(df, agg, by="color", all=TRUE) 
like image 33
Arun Avatar answered Sep 28 '22 19:09

Arun