I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:
color type 1 black chair 2 black chair 3 black sofa 4 green sofa 5 green sofa 6 red sofa 7 red plate 8 blue sofa 9 blue plate 10 blue chair
I want to add for each color
, the count of unique types
that are present in the data:
color type unique_types 1 black chair 2 2 black chair 2 3 black sofa 2 4 green sofa 1 5 green sofa 1 6 red sofa 2 7 red plate 2 8 blue sofa 3 9 blue plate 3 10 blue chair 3
I was hoping to use ave
, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.
It's somewhat similar to this issue: Count number of observations/rows per group and add result to data frame
read_csv() function in which pass the path and name of the dataset. Select the column in which you want to check or count the unique values. For finding unique values we are using unique() function provided by pandas and stored it in a variable, let named as 'unique_values'.
Here's a solution with the dplyr package - it has n_distinct()
as a wrapper for length(unique())
.
df %>% group_by(color) %>% mutate(unique_types = n_distinct(type))
Using ave
(since you ask for it specifically):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
require(data.table) setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+ # if you don't want df to be modified by reference ans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
Other solutions:
Using plyr:
require(plyr) ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x))) merge(df, agg, by="color", all=TRUE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With