I have a situation where my data frame contains the results of image analysis where the columns are the proportion of a particular class present in the image, such that an example dataframe class_df
would look like:
id A B C D E F
1 0.20 0.30 0.10 0.15 0.25 0.00
2 0.05 0.10 0.05 0.30 0.10 0.40
3 0.10 0.10 0.10 0.20 0.20 0.30
Each of these classes belongs to a functional group and I want to create new columns where the proportions of each functional group are calculated from the classes. An example mapping class_fg
class fg
A Z
B Z
C Z
D Y
E Y
F X
and the desired result would be (line added to show the desired new columns):
id A B C D E F | X Y Z
1 0.20 0.30 0.10 0.15 0.25 0.00 | 0.00 0.40 0.60
2 0.05 0.10 0.05 0.30 0.10 0.40 | 0.40 0.40 0.20
3 0.10 0.10 0.10 0.20 0.20 0.30 | 0.30 0.40 0.30
And I can do it one functional group at a time using
first_fg <- class_fg %>%
filter(fg == "Z") %>%
select(class) %>%
unlist()
class_df <- class_df %>%
mutate(Z = rowSums(select(., one_of(first_fg))))
Surely there is a better way to do this where I can calculate the row sum for each functional group without having to just repeat this code for each group? Maybe using purrr?
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
If you need to sum a column or row of numbers, let Excel do the math for you. Select a cell next to the numbers you want to sum, click AutoSum on the Home tab, press Enter, and you're done. When you click AutoSum, Excel automatically enters a formula (that uses the SUM function) to sum the numbers.
You can sum values by group with one formula easily in Excel. Select next cell to the data range, type this =IF(A2=A1,"",SUMIF(A:A,A2,B:B)), (A2 is the relative cell you want to sum based on, A1 is the column header, A:A is the column you want to sum based on, the B:B is the column you want to sum the values.)
In this article, we will learn how to Calculate Sum by groups in Rows and Columns in Excel. For Instance, you have a large list of data and you need to find the SUM of the numbers having a set of n values . For this article we will be needing the use the following functions: = SUM ( number1 , [number2], ...)
To sum multiple columns with two criteria, the formula is: =SUMPRODUCT((C2:E10) * (A2:A10=H1) * (B2:B10=H2)) These are the 3 ways to sum multiple columns based on one or more conditions in Excel.
SUMIF multiple columns The idea is to write a separate SUMIF formula for each of the columns you want to sum, and then add up the results: SUM(SUMIF(…), SUMIF(…), SUMIF(…))
Sumif with multiple criteria in one column In Excel, SUMIF function is a useful function for us to sum cells with multiple criteria in different columns, but with this function, we can also sum cells based on multiple criteria in one column.
Yet another option, along with the already contributed working solutions,
would be to use quasiquotation
with the rlang
package to build expressions to calculate the sums in each
group.
library(tidyverse)
First, define a helper function for doing an elementwise sum of vectors:
psum <- function(...) reduce(list(...), `+`)
Extracting the groupings into a list from class_fg
we can then construct
a list of expressions to calculate the sum in each group:
sum_exprs <- with(class_fg, split(class, fg)) %>%
map(~ rlang::expr(psum(!!!rlang::syms(.x))))
sum_exprs
#> $X
#> psum(F)
#>
#> $Y
#> psum(D, E)
#>
#> $Z
#> psum(A, B, C)
With the list of expressions ready, we can just "bang-bang-bang" (!!!
) them into the data with mutate
:
class_df %>%
mutate(!!!sum_exprs)
#> id A B C D E F X Y Z
#> 1 1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#> 2 2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#> 3 3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
(I used the code provided by @MKR in his answer to create the data).
Created on 2018-05-22 by the reprex package (v0.2.0).
We could split
the 'class_df' by 'class', loop through the list
elements with map
, select
the columns of 'class_df' and get the rowSums
library(tidyverse)
class_fg %>%
split(.$fg) %>%
map_df(~ class_df %>%
select(one_of(.x$class)) %>%
rowSums) %>%
bind_cols(class_df, .)
# id A B C D E F X Y Z
#1 1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2 2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3 3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
Or do a group by nest
ing, and then do the rowSums
by map
ping over the list
elements
class_fg %>%
group_by(fg) %>%
nest %>%
mutate(out = map(data, ~ class_df %>%
select(one_of(.x$class)) %>%
rowSums)) %>%
select(-data) %>%
unnest %>%
unstack(., out ~ fg) %>%
bind_cols(class_df, .)
Always it is easier to work on data in long format. Hence, change class_df
to long format using tidyr:gather
and join with class_fg
. Perform analysis in long format on your data. Finally, spread in wide-format to match expected result.
library(tidyverse)
class_df %>% gather(key, value, -id) %>%
inner_join(class_fg, by=c("key" = "class")) %>%
group_by(id, fg) %>%
summarise(value = sum(value)) %>%
spread(fg, value) %>%
inner_join(class_df, by="id") %>% as.data.frame()
# id X Y Z A B C D E F
# 1 1 0.0 0.4 0.6 0.20 0.3 0.10 0.15 0.25 0.0
# 2 2 0.4 0.4 0.2 0.05 0.1 0.05 0.30 0.10 0.4
# 3 3 0.3 0.4 0.3 0.10 0.1 0.10 0.20 0.20 0.3
Data:
class_fg <- read.table(text =
"class fg
A Z
B Z
C Z
D Y
E Y
F X",
header = TRUE, stringsAsFactors = FALSE)
class_df <- read.table(text =
"id A B C D E F
1 0.20 0.30 0.10 0.15 0.25 0.00
2 0.05 0.10 0.05 0.30 0.10 0.40
3 0.10 0.10 0.10 0.20 0.20 0.30",
header = TRUE, stringsAsFactors = FALSE)
My usual approach is to stick to base
R as long as the data sets don't get too large. In your case, a base
R solution would be:
class_df=as.data.frame(
c(class_df,
lapply(split(class_fg,class_fg$fg),
function(x) rowSums(class_df[,x$class,drop=FALSE]))))
class_df
# id A B C D E F X Y Z
#1 1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2 2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3 3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
If the data sets get too large, I use data.table
. A data.table
solution for your problem:
library(data.table)
class_dt=data.table(class_df)
grps=split(class_fg,class_fg$fg)
for (g in grps) class_dt[,c(g$fg[1]):=rowSums(.SD),.SDcols=g$class,]
class_dt
# id A B C D E F X Y Z
#1: 1 0.20 0.3 0.10 0.15 0.25 0.0 0.0 0.4 0.6
#2: 2 0.05 0.1 0.05 0.30 0.10 0.4 0.4 0.4 0.2
#3: 3 0.10 0.1 0.10 0.20 0.20 0.3 0.3 0.4 0.3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With