Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use summarise_at to apply different functions to different columns?

Tags:

r

dplyr

I have a dataframe with the following columns:

> colnames(my.dataframe)
 [1] "id"                              "firstName"                       "lastName"                       
 [4] "position"                        "jerseyNumber"                    "currentTeamId"                  
 [7] "currentTeamAbbreviation"         "currentRosterStatus"             "height"                         
[10] "weight"                          "birthDate"                       "age"                            
[13] "birthCity"                       "birthCountry"                    "rookie"                         
[16] "handednessShoots"                "college"                         "twitter"                        
[19] "currentInjuryDescription"        "currentInjuryPlayingProbability" "teamId"                         
[22] "teamAbbreviation"                "fg2PtAtt"                        "fg3PtAtt"                       
[25] "fg2PtMade"                       "fg3PtMade"                       "ftMade"                         
[28] "fg2PtPct"                        "fg3PtPct"                        "ftPct"                          
[31] "ast"                             "tov"                             "offReb"                         
[34] "foulsDrawn"                      "blkAgainst"                      "plusMinus"                      
[37] "minSeconds"

And here is my code that isn't working:

my.dataframe %>%
  dplyr::group_by(id) %>%
  dplyr::summarise_at(vars(firstName:currentInjuryPlayingProbability), funs(min), na.rm = TRUE) %>%
  dplyr::summarise_at(vars(fg2PtAtt:minSeconds), funs(sum), na.rm = TRUE) %>% 
                    vars(), funs(min), na.rm = TRUE) %>%
  dplyr::summarise(teamId = paste(teamId), teamAbbreviation = paste(teamAbbreviation)) 

First I group by id (which is not a unique column in my dataframe, despite it being called id). For the next 19 columns up until currentInjuryPlayingProbability, these columns are always the same when grouped_by the ID, and so I use the min function to summarise / grab the value.

Next, I want to summarise all columns from fg2PtAtt to the end with the mean value (these columns are all numeric / integer).

Lastly, for the columns teamId and teamAbbreviation (which are not the same when grouped_by id), I want to paste them into a single string each with summarise.

My approach doesn't work because I don't think I can call summarise_at, followed by another summarise_at, followed by a summarise. By the time the second summarise_at is called, the columns trying to be summarised were already removed by the first summarise_at

Any help with this is appreciated!I will update with a subset of my dataframe shortly that code can be tested on.

EDIT:

dput(my.dataframe)
structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", 
"Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", 
"Babbitt"), currentInjuryPlayingProbability = c(NA_character_, 
NA_character_, NA_character_, NA_character_), teamId = c(96L, 
84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"
), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 
111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

my.dataframe
     id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <int> <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96 OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84 BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                                91 ATL                    57      111      34210
4  9360 Luke      Babbitt  <NA>                                92 MIA                     2       45       8676

here is a shorted example with only 9 columns, but with enough data to highlight the problems. The resulting dataframe should look like this:

    id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
  <int> <chr>     <chr>    <chr>                            <chr>     <chr>               <int>    <int>      <int>
1 10138 Alex      Abrines  <NA>                                96      OKL                    70      221      67637
2  9466 Quincy    Acy      <NA>                                84      BRO                    73      292      81555
3  9360 Luke      Babbitt  <NA>                            91, 92 ATL, MIA                     57      156      42886
like image 752
Canovice Avatar asked Jul 31 '18 17:07

Canovice


People also ask

How do I apply a function across a column in R?

apply() lets you perform a function across a data frame's rows or columns. In the arguments, you specify what you want as follows: apply(X = data. frame, MARGIN = 1, FUN = function. you.

How does Group_by work in R?

The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.


1 Answers

This is what I think is the simplest way for this particular task, at least compared to some similar map2/reduce solutions I've seen.

First point is that if you are using min to grab a value because you think it should be the same for every value of your grouping variable, just add it to the grouping. Then it is automatically preserved.

Second is that you can use {} to override the automatic placement of the LHS of %>% into the first argument of the RHS. This lets you in a single step apply different transformations and recombine them. Usually you don't need this because the placeholder . will do it for you, but if the placeholder is not a naked argument to the RHS you sometimes need it. (I am sure I read some resource that describes the exact rules but I can't find it right now).

Third is that because you know that summarise will drop columns you didn't select except the grouping variables, left_join will automatically use the shared column names to join on.

This means that we can do the following, which I think is pretty clean. If the transformations start to get particularly complex though (like if there are pipes inside the left_join I would recommend giving each piece of the final output its own assignment and name, to be clearer. You also need to be careful if you want more than one summary of the same column (like both mean and standard deviation), because as written the names will collide.

library(tidyverse)

my_dataframe <- structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", "Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", "Babbitt"), currentInjuryPlayingProbability = c(NA_character_, NA_character_, NA_character_, NA_character_), teamId = c(96L, 84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

my_dataframe %>%
  group_by_at(.vars = vars(id:lastName)) %>%
  {left_join(
    summarise_at(., vars(teamId:teamAbbreviation), ~ str_c(., collapse = ",")),
    summarise_at(., vars(fg2PtAtt:minSeconds), mean)
  )}
#> Joining, by = c("id", "firstName", "lastName")
#> # A tibble: 3 x 8
#> # Groups:   id, firstName [?]
#>      id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#>   <int> <chr>     <chr>    <chr>  <chr>               <dbl>    <dbl>
#> 1  9360 Luke      Babbitt  91,92  ATL,MIA              29.5       78
#> 2  9466 Quincy    Acy      84     BRO                  73        292
#> 3 10138 Alex      Abrines  96     OKL                  70        221
#> # ... with 1 more variable: minSeconds <dbl>

Created on 2018-07-31 by the reprex package (v0.2.0).

like image 87
Calum You Avatar answered Sep 24 '22 03:09

Calum You