Summing values in R based on column value with dplyr

Tags:

I have a data set that has the following information:

Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1

If the value of UniqueNumber > 0, I would like to sum the values with dplyr for each subject from rows 1 through UniqueNumber and calculate the mean. So for Subject 001, sum = 2 and mean = .67.

total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
   for(j in 1:ncols(Data)){
   if(Data$UniqueNumber[i] > 0){
    total[i] = sum(Data[i,1:j])
    average[i] = mean(Data[i,1:j])
   }
}

Edit: I am only looking to sum through the number of columns listed in the 'UniqueNumber' column. So this is looping through every row and stopping at column listed in 'UniqueNumber'. Example: Row 2 with Subject 002 should sum up the values in columns 'Value1' and 'Value2', while Row 3 with Subject 003 should only sum the value in column 'Value1'.

996

asked Feb 08 '19 22:02

statsguyz

6 Answers

Not a tidyverse fan/expert, but I would try this using long format. Then, just filter by row index per group and then run any functions you want on a single column (much easier this way).

library(tidyr)
library(dplyr)

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% # long format
  group_by(Subject) %>% # group by Subject in order to get row counts
  filter(row_number() <= UniqueNumber) %>% # filter by row index
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup() 

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

A very similar way to achieve this could be filtering by the integers in the column names. The filter step comes before the group_by so it could potentially increase performance (or not?) but it is less robust as I'm assuming that the cols of interest are called "Value#"

Data %>% 
  gather(variable, value, -Subject, -UniqueNumber) %>% #long format
  filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
  group_by(Subject) %>% # group by Subject
  summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
  ungroup()

## A tibble: 3 x 3
#  Subject  Mean Total
#     <int> <dbl> <int>
# 1       1 0.667     2
# 2       2 0.5       1
# 3       3 1         1

Just for fun, adding a data.table solution

library(data.table)

data.table(Data) %>% 
  melt(id = c("Subject", "UniqueNumber")) %>%
  .[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
    .(Mean = round(mean(value), 3), Total = sum(value)),
    by = Subject]

#    Subject  Mean Total
# 1:       1 0.667     2
# 2:       2 0.500     1
# 3:       3 1.000     1

107

answered Oct 16 '22 09:10

David Arenburg

Here is another method that uses tidyr::nest to collect the Values columns into a list so that we can iterate through the table with map2. In each row, we select the correct values from the Values list-col and take the sum or mean respectively.

library(tidyverse)
tbl <- read_table2(
"Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1"
)
tbl %>%
  filter(UniqueNumber > 0) %>%
  nest(starts_with("Value"), .key = "Values") %>%
  mutate(
    sum = map2_dbl(UniqueNumber, Values, ~ sum(.y[1:.x], na.rm = TRUE)),
    mean = map2_dbl(UniqueNumber, Values, ~ mean(as.numeric(.y[1:.x], na.rm = TRUE))),
  )
#> # A tibble: 3 x 5
#>   Subject UniqueNumber Values             sum  mean
#>   <chr>          <dbl> <list>           <dbl> <dbl>
#> 1 001                3 <tibble [1 × 3]>     2 0.667
#> 2 002                2 <tibble [1 × 3]>     1 0.5  
#> 3 003                1 <tibble [1 × 3]>     1 1

^{Created on 2019-02-14 by the reprex package (v0.2.1)}

answered Oct 16 '22 10:10

Calum You

OP might be interested only for dplyr solution but for comparison purposes and for future readers a base R option using mapply

cols <- grep("^Value", names(df))

cbind(df, t(mapply(function(x, y) {
      if (y > 0) {
        vals = as.numeric(df[x, cols[1:y]])
        c(Sum = sum(vals, na.rm = TRUE), Mean = mean(vals, na.rm = TRUE))
       }
       else 
        c(0, 0)
},1:nrow(df), df$UniqueNumber)))

#  Subject Value1 Value2 Value3 UniqueNumber Sum  Mean
#1       1      1      0      1            3   2 0.667
#2       2      0      1      1            2   1 0.500
#3       3      1      1      1            1   1 1.000

Here we subset each row based on its respective UniqueNumber and then calculate it's sum and mean if the UniqueNumber value is greater than 0 or else return only 0.

answered Oct 16 '22 10:10

Ronak Shah

Check this solution:

df %>%
  gather(key, val, Value1:Value3) %>%
  group_by(Subject) %>%
  mutate(
    Sum = sum(val[c(1:(UniqueNumber[1]))]),
    Mean = mean(val[c(1:(UniqueNumber[1]))]),
  ) %>%
  spread(key, val)

Output:

 Subject UniqueNumber   Sum  Mean Value1 Value2 Value3
  <chr>          <int> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 001                3     2 0.667      1      0      1
2 002                2     1 0.5        0      1      1
3 003                1     1 1          1      1      1

answered Oct 16 '22 08:10

Paweł Chabros

A solution that uses purrr::map_df(which is from the same author as dplyr).

library(dplyr)
library(purrr)
l_dat <- split(dat, dat$Subject) # first we need to split in a list

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber # finds the number of columns
  x <- as.numeric(x[2:(n_cols+1)]) # subsets x and converts to numeric
  mean(x, na.rm=T) # mean to be returned
})
# output:
# # A tibble: 1 x 3
#     `1`   `2`   `3`
#   <dbl> <dbl> <dbl>
# 1 0.667   0.5     1

Another option (output format closer to a dplyr solution):

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber
  id <- x$Subject
  x <- as.numeric(x[2:(n_cols+1)])
  tibble(id=id, mean_values=mean(x, na.rm=T))
})
# # A tibble: 3 x 2
# id mean_values
# <int>       <dbl>
# 1     1       0.667
# 2     2       0.5  
# 3     3       1

Just as an example I added a sum() then divided by length(x)-1:

map_df(l_dat, function(x) {
  n_cols <- x$UniqueNumber
  id <- x$Subject
  x <- as.numeric(x[2:(n_cols+1)])
  tibble(id=id, 
                mean_values=sum(x, na.rm=T)/(length(x)-1)) # change here
})
# # A tibble: 3 x 2
# id mean_values
# <int>       <dbl>
# 1     1          1.
# 2     2          1.
# 3     3        Inf  #beware of this case where you end up dividing by 0

Data:

tt <- "Subject    Value1    Value2    Value3      UniqueNumber
001        1         0         1           3
002        0         1         1           2
003        1         1         1           1"

dat <- read.table(text=tt, header=T)

answered Oct 16 '22 08:10

RLave

I think the easiest way is to set to NA the zeros that really should be NA, then use rowSums and rowMeans on the appropriate subset of columns.

Data[2:4][(col(dat[2:4])>dat[[5]])] <- NA
Data
#   Subject Value1 Value2 Value3 UniqueNumber
# 1       1      1      0      1            3
# 2       2      0      1     NA            2
# 3       3      1     NA     NA            1

library(dplyr)
Data%>%
  mutate(sum  =  rowSums(.[2:4], na.rm = TRUE),
         mean = rowMeans(.[2:4], na.rm = TRUE))

#   Subject Value1 Value2 Value3 UniqueNumber sum      mean
# 1       1      1      0      1            3   2 0.6666667
# 2       2      0      1     NA            2   1 0.5000000
# 3       3      1     NA     NA            1   1 1.0000000

or transform(Data, sum = rowSums(Data[2:4],na.rm = TRUE), mean = rowMeans(Data[2:4],na.rm = TRUE)) to stay in base R.

data

Data <- structure(
  list(Subject = 1:3, 
       Value1 = c(1L, 0L, 1L), 
       Value2 = c(0L, 1L, NA), 
       Value3 = c(1L, NA, NA), 
       UniqueNumber = c(3L, 2L, 1L)), 
  .Names = c("Subject","Value1", "Value2", "Value3", "UniqueNumber"),
  row.names = c(NA, 3L), class = "data.frame")

answered Oct 16 '22 10:10

Moody_Mudskipper

Related questions
                            
                                R: How to overload the `-` operator if you want use `-` in your 'overload' code?
                            
                                function(x) in R: writing a "function" without defining a function?
                            
                                Adding a legend to an rgl 3d plot
                            
                                Efficient linked list (ordered set) in R
                            
                                R stargazer, lme4 and lmerTest incompatibility
                            
                                Rmarkdown removes citation hyperlink
                            
                                Fill different colors for each quantile in geom_density() of ggplot [duplicate]
                            
                                Breaking out of nested loops in R
                            
                                Disabling buttons in Shiny
                            
                                R: What do you call the :: and ::: operators and how do they differ?
                            
                                Apply a dataframe of boolean values onto another dataframe in R
                            
                                R script in Power BI returns date as Microsoft.OleDb.Date
                            
                                using purrr to affect single columns of each dataframe in a list
                            
                                how to change the color in geom_point or lines in ggplot [duplicate]
                            
                                How to use the Google satellite view as tile in leaflet with R
                            
                                Saving ggplot graph to PDF with fonts embedded in r
                            
                                Rscript detect if R script is being called/sourced from another script
                            
                                Merge and Perfectly Align Histogram and Boxplot using ggplot2
                            
                                How to download entire repository from Github using R?
                            
                                dplyr rowwise sum and other functions like max

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Summing values in R based on column value with dplyr

Tags:

dataframe

r

dplyr

statsguyz

People also ask

6 Answers

David Arenburg

Calum You

Ronak Shah

Paweł Chabros

RLave

Moody_Mudskipper

Recent Activity

Donate For Us