Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summarizing data by name separated across multiple variables

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by (player name) is listed across three variables (goal, primary assist and secondary assist)

Here's my reproducible data (it's from dput(), so apologies for the mess).

mydata <- structure(list(primary_assist = c("Dmitry Gilyazitdinov", "Evgeny Orlov", 
"Anton Burdasov", "Sergei Kalinin", "Stanislav Solovyov", "Vasily Streltsov", 
NA, "Bogdan Potekhin", "Bogdan Potekhin", "Vasily Streltsov", 
"Vasily Streltsov", "Viktor Postnikov", "Danil Kaskov", NA, NA, 
"Artemy Panarin"), secondary_assist = c("Andrei Badrutdinov", 
NA, NA, NA, "Danil Gubarev", "Nikita Manukhov", NA, "Evgeny Grigorenko", 
"Daniil Apalkov", "Ivan Boiko", NA, "Viktor Antipin", "Vitaly Sychov", 
NA, NA, "Stanislav Levin"), goal = c("Vitaly Kropachyov", "Dmitry Kozlov", 
"Stanislav Solovyov", "Kirill Polyansky", "Anton Burdasov", "Ilya Solodov", 
"Alexander Antropov", "Daniil Apalkov", "Evgeny Grigorenko", 
"Alexander Antropov", "Alexander Antropov", "Evgeny Grigorenko", 
"Denis Belonogov", "Vitaly Sychov", "Alexander Streltsov", "Pyotr Kopyttsov"
), team = c("Belye Medvedi", "Omskie Yastreby", "Belye Medvedi", 
"Omskie Yastreby", "Belye Medvedi", "Avto", "Avto", "Stalnye Lisy", 
"Stalnye Lisy", "Avto", "Avto", "Stalnye Lisy", "Avto", "Avto", 
"Avto", "Russkie Vityazi"), game_strength = c("PP", "EV", "EV", 
"EV", "EV", "PP", "SO", "EV", "PP", "PP", "EV", "PP", "PP", "EV", 
"PP", "EV"), season = c("2009-10", "2009-10", "2009-10", "2009-10", 
"2009-10", "2009-10", "2009-10", "2009-10", "2009-10", "2009-10", 
"2009-10", "2009-10", "2009-10", "2009-10", "2009-10", "2009-10"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-16L), .Names = c("primary_assist", "secondary_assist", "goal", 
"team", "game_strength", "season"))
mydata
#> # A tibble: 16 x 6
#>    primary_assist       secondary_assist  goal  team  game_strength season
#>    <chr>                <chr>             <chr> <chr> <chr>         <chr> 
#>  1 Dmitry Gilyazitdinov Andrei Badrutdin~ Vita~ Bely~ PP            2009-~
#>  2 Evgeny Orlov         <NA>              Dmit~ Omsk~ EV            2009-~
#>  3 Anton Burdasov       <NA>              Stan~ Bely~ EV            2009-~
#>  4 Sergei Kalinin       <NA>              Kiri~ Omsk~ EV            2009-~
#>  5 Stanislav Solovyov   Danil Gubarev     Anto~ Bely~ EV            2009-~
#>  6 Vasily Streltsov     Nikita Manukhov   Ilya~ Avto  PP            2009-~
#>  7 <NA>                 <NA>              Alex~ Avto  SO            2009-~
#>  8 Bogdan Potekhin      Evgeny Grigorenko Dani~ Stal~ EV            2009-~
#>  9 Bogdan Potekhin      Daniil Apalkov    Evge~ Stal~ PP            2009-~
#> 10 Vasily Streltsov     Ivan Boiko        Alex~ Avto  PP            2009-~
#> 11 Vasily Streltsov     <NA>              Alex~ Avto  EV            2009-~
#> 12 Viktor Postnikov     Viktor Antipin    Evge~ Stal~ PP            2009-~
#> 13 Danil Kaskov         Vitaly Sychov     Deni~ Avto  PP            2009-~
#> 14 <NA>                 <NA>              Vita~ Avto  EV            2009-~
#> 15 <NA>                 <NA>              Alex~ Avto  PP            2009-~
#> 16 Artemy Panarin       Stanislav Levin   Pyot~ Russ~ EV            2009-~

So, I want to count the number of goals, primary assists, and secondary assists for each player, and then have 1 row for each player. Let's say the name "Artemy Panarin" is listed 1 time in goals, 0 times in primary assists, and 2 times in secondary assists, my output would look like this:

tibble::tibble(name = c("Artemy Panarin", "Stanislav Levin", "Danil Kaskov"), team = c("Russkie Vityazi", "Russkie Vityazi", "Avto"), goals = c(1, 1, 0), primary_assists = c(0, 0, 1), secondary_assists = c(2, 0, 0))
#> # A tibble: 3 x 5
#>   name            team            goals primary_assists secondary_assists
#>   <chr>           <chr>           <dbl>           <dbl>             <dbl>
#> 1 Artemy Panarin  Russkie Vityazi  1.00            0                 2.00
#> 2 Stanislav Levin Russkie Vityazi  1.00            0                 0   
#> 3 Danil Kaskov    Avto             0               1.00              0

Does that make any sense? Any ideas? Tidyverse solutions preferred. Thanks!

like image 786
Evan O. Avatar asked Jul 18 '18 15:07

Evan O.


People also ask

What are the ways of summarizing data?

The three common ways of looking at the center are average (also called mean), mode and median. All three summarize a distribution of the data by describing the typical value of a variable (average), the most frequently repeated number (mode), or the number in the middle of all the other numbers in a data set (median).

Which do you think is the best method to use in summarizing and describing business data table or graph?

Graphical displays are very useful for summarizing data, and both dichotomous and non-ordered categorical variables are best summarized with bar charts.

Why is it necessary summarize data?

One reason that we summarize data is that it provides us with a way to generalize - that is, to make general statements that extend beyond specific observations.


2 Answers

We can gather to 'long' format, grouped by 'name', 'team', and the 'key' column (from gather), summarise to get the count and spread back to 'wide' format

library(tidyverse)
gather(mydata, key, name, primary_assist:goal) %>% 
     group_by(name, team, key) %>% 
     summarise(n = n()) %>% 
     spread(key, n, fill = 0)
# A tibble: 30 x 5
# Groups:   name, team [30]
#   name                team             goal primary_assist secondary_assist
#   <chr>               <chr>           <dbl>          <dbl>            <dbl>
# 1 Alexander Antropov  Avto                3              0                0
# 2 Alexander Streltsov Avto                1              0                0
# 3 Andrei Badrutdinov  Belye Medvedi       0              0                1
# 4 Anton Burdasov      Belye Medvedi       1              1                0
# 5 Artemy Panarin      Russkie Vityazi     0              1                0
# 6 Bogdan Potekhin     Stalnye Lisy        0              2                0
# 7 Daniil Apalkov      Stalnye Lisy        1              0                1
# 8 Danil Gubarev       Belye Medvedi       0              0                1
# 9 Danil Kaskov        Avto                0              1                0
#10 Denis Belonogov     Avto                1              0                0
# ... with 20 more rows
like image 138
akrun Avatar answered Sep 20 '22 15:09

akrun


One way to get the result is to reshape your data with gather()/spread() in addition to your summarize strategy.

library(tidyverse)

scoring_summary <- mydata %>%
    select(primary_assist:team) %>%
    gather("key", "player", -team) %>%
    group_by(player) %>%
    count(key) %>%
    spread(key, n)

# convert NAs to 0
scoring_summary[is.na(scoring_summary)] <- 0

scoring_summary

# A tibble: 28 x 4
# Groups:   player [28]
player               goal primary_assist secondary_assist
<chr>               <dbl>          <dbl>            <dbl>
1 Alexander Antropov      3              0                0
2 Alexander Streltsov     1              0                0
3 Andrei Badrutdinov      0              0                1
4 Anton Burdasov          1              1                0
5 Artemy Panarin          0              1                0
6 Bogdan Potekhin         0              2                0
7 Daniil Apalkov          1              0                1
8 Danil Gubarev           0              0                1
9 Danil Kaskov            0              1                0
10 Denis Belonogov        1              0                0

count() is doing the same thing as your original attempt with summarise(count(goals)

like image 44
Nate Avatar answered Sep 21 '22 15:09

Nate