I have a dataframe with the following columns:
> colnames(my.dataframe)
[1] "id" "firstName" "lastName"
[4] "position" "jerseyNumber" "currentTeamId"
[7] "currentTeamAbbreviation" "currentRosterStatus" "height"
[10] "weight" "birthDate" "age"
[13] "birthCity" "birthCountry" "rookie"
[16] "handednessShoots" "college" "twitter"
[19] "currentInjuryDescription" "currentInjuryPlayingProbability" "teamId"
[22] "teamAbbreviation" "fg2PtAtt" "fg3PtAtt"
[25] "fg2PtMade" "fg3PtMade" "ftMade"
[28] "fg2PtPct" "fg3PtPct" "ftPct"
[31] "ast" "tov" "offReb"
[34] "foulsDrawn" "blkAgainst" "plusMinus"
[37] "minSeconds"
And here is my code that isn't working:
my.dataframe %>%
dplyr::group_by(id) %>%
dplyr::summarise_at(vars(firstName:currentInjuryPlayingProbability), funs(min), na.rm = TRUE) %>%
dplyr::summarise_at(vars(fg2PtAtt:minSeconds), funs(sum), na.rm = TRUE) %>%
vars(), funs(min), na.rm = TRUE) %>%
dplyr::summarise(teamId = paste(teamId), teamAbbreviation = paste(teamAbbreviation))
First I group by id (which is not a unique column in my dataframe, despite it being called id). For the next 19 columns up until currentInjuryPlayingProbability, these columns are always the same when grouped_by the ID, and so I use the min
function to summarise / grab the value.
Next, I want to summarise all columns from fg2PtAtt
to the end with the mean value (these columns are all numeric / integer).
Lastly, for the columns teamId and teamAbbreviation (which are not the same when grouped_by id), I want to paste them into a single string each with summarise.
My approach doesn't work because I don't think I can call summarise_at, followed by another summarise_at, followed by a summarise. By the time the second summarise_at is called, the columns trying to be summarised were already removed by the first summarise_at
Any help with this is appreciated!I will update with a subset of my dataframe shortly that code can be tested on.
EDIT:
dput(my.dataframe)
structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex",
"Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt",
"Babbitt"), currentInjuryPlayingProbability = c(NA_character_,
NA_character_, NA_character_, NA_character_), teamId = c(96L,
84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"
), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L,
111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
my.dataframe
id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
<int> <chr> <chr> <chr> <int> <chr> <int> <int> <int>
1 10138 Alex Abrines <NA> 96 OKL 70 221 67637
2 9466 Quincy Acy <NA> 84 BRO 73 292 81555
3 9360 Luke Babbitt <NA> 91 ATL 57 111 34210
4 9360 Luke Babbitt <NA> 92 MIA 2 45 8676
here is a shorted example with only 9 columns, but with enough data to highlight the problems. The resulting dataframe should look like this:
id firstName lastName currentInjuryPlayingProbability teamId teamAbbreviation fg2PtAtt fg3PtAtt minSeconds
<int> <chr> <chr> <chr> <chr> <chr> <int> <int> <int>
1 10138 Alex Abrines <NA> 96 OKL 70 221 67637
2 9466 Quincy Acy <NA> 84 BRO 73 292 81555
3 9360 Luke Babbitt <NA> 91, 92 ATL, MIA 57 156 42886
apply() lets you perform a function across a data frame's rows or columns. In the arguments, you specify what you want as follows: apply(X = data. frame, MARGIN = 1, FUN = function. you.
The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
This is what I think is the simplest way for this particular task, at least compared to some similar map2
/reduce
solutions I've seen.
First point is that if you are using min
to grab a value because you think it should be the same for every value of your grouping variable, just add it to the grouping. Then it is automatically preserved.
Second is that you can use {}
to override the automatic placement of the LHS of %>%
into the first argument of the RHS. This lets you in a single step apply different transformations and recombine them. Usually you don't need this because the placeholder .
will do it for you, but if the placeholder is not a naked argument to the RHS you sometimes need it. (I am sure I read some resource that describes the exact rules but I can't find it right now).
Third is that because you know that summarise
will drop columns you didn't select except the grouping variables, left_join
will automatically use the shared column names to join on.
This means that we can do the following, which I think is pretty clean. If the transformations start to get particularly complex though (like if there are pipes inside the left_join
I would recommend giving each piece of the final output its own assignment and name, to be clearer. You also need to be careful if you want more than one summary of the same column (like both mean and standard deviation), because as written the names will collide.
library(tidyverse)
my_dataframe <- structure(list(id = c(10138L, 9466L, 9360L, 9360L), firstName = c("Alex", "Quincy", "Luke", "Luke"), lastName = c("Abrines", "Acy", "Babbitt", "Babbitt"), currentInjuryPlayingProbability = c(NA_character_, NA_character_, NA_character_, NA_character_), teamId = c(96L, 84L, 91L, 92L), teamAbbreviation = c("OKL", "BRO", "ATL", "MIA"), fg2PtAtt = c(70L, 73L, 57L, 2L), fg3PtAtt = c(221L, 292L, 111L, 45L), minSeconds = c(67637L, 81555L, 34210L, 8676L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
my_dataframe %>%
group_by_at(.vars = vars(id:lastName)) %>%
{left_join(
summarise_at(., vars(teamId:teamAbbreviation), ~ str_c(., collapse = ",")),
summarise_at(., vars(fg2PtAtt:minSeconds), mean)
)}
#> Joining, by = c("id", "firstName", "lastName")
#> # A tibble: 3 x 8
#> # Groups: id, firstName [?]
#> id firstName lastName teamId teamAbbreviation fg2PtAtt fg3PtAtt
#> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 9360 Luke Babbitt 91,92 ATL,MIA 29.5 78
#> 2 9466 Quincy Acy 84 BRO 73 292
#> 3 10138 Alex Abrines 96 OKL 70 221
#> # ... with 1 more variable: minSeconds <dbl>
Created on 2018-07-31 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With