Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Group By Multiple Columns

Tags:

r

I'm trying to run analysis on a dataset that categorizes companies into 20 different industries, and some 800 categories. Each industry category is in it's own column. Here's a sample dataframe

df <- data.frame(biz.name=c("goog", "face", "eb"), worth=c(100, 200, 300),
cat1=c("social", "social", "social"), cat2=c(NA, "search", "finance"),
cat3=c(NA, NA, "commerce"))

I'd like to know how to run analysis on different types of categories. For instance, how would I get the the average worth of different categories, "social" or "finance". Each company can be in up to 20 categories (non-repeating per row).

The dplyr package is my normal go-to group_by method, but chaining doesn't seem to work for multiple columns

cat.test <- df %>% 
  group_by(cat1:cat2) %>%
  summarise (avg = mean(is.na(worth)))

The code produces a measure for each permutation of businesses with a combination of multiple categories, rather that each category individually. In the sample data frame, the category social should have a total net worth of 600 and mean of 300.

I've looked at multiple tutorials, but haven't found one that can group_by for multiple columns. Thanks and let me know if i can make this question any more clear.

[UPDATE: edited data.frame code]

like image 458
tom Avatar asked Dec 25 '22 14:12

tom


2 Answers

I cleaned up your code and was able to get a result out using the data.table package:

df <- data.frame(biz.name=c("goog", "face", "eb"), worth=c(100, 200, 300), 
                 cat1=c("social", "social", "social"), cat2=c("NA", "search", "finance"),
                 cat3=c("NA", "NA", "commerce"))

library(data.table)
dt <- data.table(df)
dt[, Mean:=mean(worth), by=list(cat1, cat2)]

> dt
     biz.name  worth   cat1    cat2     cat3 Mean
1:       goog    100 social      NA       NA  100
2:       face    200 social  search       NA  200
3:         eb    300 social finance commerce  300
like image 162
Tim Biegeleisen Avatar answered Jan 31 '23 05:01

Tim Biegeleisen


I would use data.table this way:

library(data.table)
melt(setDT(df[-1]), id.vars='worth', value.name='category')[,.(worth=sum(worth)),category]
#   category worth
#1:   social   600
#2:       NA   400
#3:   search   200
#4:  finance   300
#5: commerce   300
like image 39
Colonel Beauvel Avatar answered Jan 31 '23 05:01

Colonel Beauvel