Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate count by group, then keep only one per group

Say that I have this data.frame, data:

data <- data.frame(val=c(rep(6,10), rep(7, 15), rep(8, 20), rep(9, 25), rep(10, 100), rep(11, 20), rep(12, 15), rep(13, 10)))
data$plus <- data$val + 100

My goal is to create a new data.frame that has the frequencies of each val, and the associated plus value.

My current strategy is to create a table (called table), then merge the frequencies. Then to keep only the first observation within each group:

table <- table(data$val)
df1 <- data.frame(val = as.integer(names(table)[1:length(table)]), N = table[1:length(table)])
df2 <- merge(data, df1)
df3 <- do.call(rbind, by(df2, list(df2$val), FUN=function(x) head(x, 1)))

This works, but it seems clunky.

In Stata, for example, it would be less and simpler code. Something like:

bys val plus: egen max = _N
bys val plus: gen first = _n==1
keep if first==1

Is there a way to simplify or make more elegant the R code?

like image 750
bill999 Avatar asked Dec 10 '22 23:12

bill999


2 Answers

Here's an approach using "data.table":

library(data.table)
as.data.table(data)[, N := .N, by = val][, .SD[1], by = val]
#    val plus   N
# 1:   6  106  10
# 2:   7  107  15
# 3:   8  108  20
# 4:   9  109  25
# 5:  10  110 100
# 6:  11  111  20
# 7:  12  112  15
# 8:  13  113  10

## Or (@RicardoSaporta)
as.data.table(data)[, list(.N, plus=plus[1]), by = val]

## Or (@DavidArenburg)
unique(as.data.table(data)[, N := .N, by = val], by = "val")

With "dplyr", you can try:

library(dplyr)

data %>%
  group_by(val) %>%
  mutate(N = n()) %>%
  slice(1)

In base R, I guess you can try something like:

do.call(rbind, lapply(split(data, data$val), 
                      function(x) cbind(x, N = nrow(x))[1, ]))
like image 149
A5C1D2H2I1M1N2O1R2T1 Avatar answered Mar 16 '23 19:03

A5C1D2H2I1M1N2O1R2T1


Edited

Or you can use aggregate()

data$N = 0
out = aggregate(N ~ val + plus, data = data, length)

or else

out = aggregate(plus ~val, data = data,function(x) c(unique(x), N = length(x)))
do.call(data.frame, out)

or using ddply

library(plyr)
out = ddply(data, .(val,plus), summarize, N = length(plus))

#> out
#  val plus   N
#1   6  106  10
#2   7  107  15
#3   8  108  20
#4   9  109  25
#5  10  110 100
#6  11  111  20
#7  12  112  15
#8  13  113  10
like image 45
Veerendra Gadekar Avatar answered Mar 16 '23 19:03

Veerendra Gadekar