I have a panel dataset at the country and year level, and I'd like to create a two new variables based on existing ones.
year | country | var1 | var2 | var3 | var 4 | mean_var1 | relmean_var1 |
---|---|---|---|---|---|---|---|
1910 | GER | 1 | 4 | 10 | 6 | 3 | 0.333 |
1911 | GER | 2 | 3 | 11 | 7 | 1.5 | 1.3333 |
1910 | FRA | 5 | 6 | 8 | 9 | 3 | 1.66667 |
1911 | FRA | 1 | 4 | 10 | 9 | 1.5 | .66667 |
What I'd like to do is create two new variables set : (1) a variable set of the average for each year (across countries) and (2) a variable set of the country value relative to the year-average. For example, for var1(1) would yield mean_var1 and (2) relmean_var1 and I'd want these for all the other variables. In total, there are over 1000 variables in the dataset, but I would only apply this function to about 6.
I have code that works for the first part, but I'd like to combine it as efficiently as possible with the second.
library(dplyr)
library(purrr)
df<- df%>%
group_by(year) %>%
mutate_at(.funs = list(mean = ~mean(.)), .vars = c("var1", "var1", "var1", "var4"))
This code yields new variables called var1_mean (I would prefer mean_var1: how do I change this name?)
For the second step, I've tried:
df <- df %>%
map2_dfr(.x = d.test %>%
select(var1, var2),
.y = d.test %>%
select(var1_mean, var2_mean),
~ .x / .y) %>%
setNames(c("relmean_var1", "relmean_var2"))
and I get errors
""Error in select(., var1, var2) : object 'd.test' not found."
. (I got this set up from this question)
I also tried:
map2(var1, var1_mean, ~ df[[.x]] / df[[.y]]) %>%
set_names(cols) %>%
bind_cols(df, .)
And got
"Error in map2(var1, var1_mean, ~df[[.x]]/df[[.y]]) : object 'var1' not found
What's the best way to combine these two goals? Ideally with the naming scheme mean_var1 for (1) and relmean_var1 for (2)
Edit: input dataframe should look like this:
data <- tibble::tribble(
~year, ~country, ~var1, ~var2, ~var3, ~var.4,
1910L, "GER", 1L, 4L, 10L, 6L,
1911L, "GER", 2L, 3L, 11L, 7L,
1910L, "FRA", 5L, 6L, 8L, 9L,
1911L, "FRA", 1L, 4L, 10L, 9L
)
output dataframe should look like this (for all variables, just showing var1 as an example, but should be the same format for var2 through var4):
datanew <- tibble::tribble(
~year, ~country, ~var1, ~var2, ~var3, ~var.4, ~mean_var1 , ~relmean_var1
1910L, "GER", 1L, 4L, 10L, 6L, 3L, .3333L,
1911L, "GER", 2L, 3L, 11L, 7L, 1.5L, 1.3333L,
1910L, "FRA", 5L, 6L, 8L, 9L, 3L, 1.6667L,
1911L, "FRA", 1L, 4L, 10L, 9L 1.5L, .6667L,
)
This might be easier in long format, but here's an option you can pursue as wide data.
Using the latest version of dplyr
you can mutate
across
and include .names
argument to define how your want your new columns to look.
library(tidyverse)
my_col <- c("var1", "var2", "var3", "var4")
df %>%
group_by(year) %>%
mutate(across(my_col, mean, .names = "mean_{col}")) %>%
mutate(across(my_col, .names = "relmean_{col}") / across(paste0("mean_", my_col)))
Output
year country var1 var2 var3 var4 mean_var1 mean_var2 mean_var3 mean_var4 relmean_var1 relmean_var2 relmean_var3 relmean_var4
<int> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1910 GER 1 4 10 6 3 5 9 7.5 0.333 0.8 1.11 0.8
2 1911 GER 2 3 11 7 1.5 3.5 10.5 8 1.33 0.857 1.05 0.875
3 1910 FRA 5 6 8 9 3 5 9 7.5 1.67 1.2 0.889 1.2
4 1911 FRA 1 4 10 9 1.5 3.5 10.5 8 0.667 1.14 0.952 1.12
library(tidyverse)
data <- tibble::tribble(
~year, ~country, ~var1, ~var2, ~var3, ~var.4,
1910L, "GER", 1L, 2L, 10L, 6L,
1911L, "GER", 2L, 3L, 11L, 7L,
1910L, "FRA", 5L, 6L, 8L, 9L,
1911L, "FRA", 1L, 3L, 10L, 9L
)
data_long <-
data %>%
pivot_longer(-c(year, country))
data_long
#> # A tibble: 16 x 4
#> year country name value
#> <int> <chr> <chr> <int>
#> 1 1910 GER var1 1
#> 2 1910 GER var2 2
#> 3 1910 GER var3 10
#> 4 1910 GER var.4 6
#> 5 1911 GER var1 2
#> 6 1911 GER var2 3
#> 7 1911 GER var3 11
#> 8 1911 GER var.4 7
#> 9 1910 FRA var1 5
#> 10 1910 FRA var2 6
#> 11 1910 FRA var3 8
#> 12 1910 FRA var.4 9
#> 13 1911 FRA var1 1
#> 14 1911 FRA var2 3
#> 15 1911 FRA var3 10
#> 16 1911 FRA var.4 9
means_country <-
data_long %>%
group_by(country) %>%
summarise(mean_country_value = mean(value))
means_years <-
data_long %>%
group_by(year) %>%
summarise(mean_year_value = mean(value))
data %>%
left_join(means_country) %>%
left_join(means_years)
#> Joining, by = "country"
#> Joining, by = "year"
#> # A tibble: 4 x 8
#> year country var1 var2 var3 var.4 mean_country_value mean_year_value
#> <int> <chr> <int> <int> <int> <int> <dbl> <dbl>
#> 1 1910 GER 1 2 10 6 5.25 5.88
#> 2 1911 GER 2 3 11 7 5.25 5.75
#> 3 1910 FRA 5 6 8 9 6.38 5.88
#> 4 1911 FRA 1 3 10 9 6.38 5.75
Created on 2021-11-24 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With