Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiply pairs of columns using dplyr in R

Tags:

r

dplyr

I have a dataframe with crime data and associated "prices", organized by country and year (although I don't think this is important here). Here is a subset of my data:

> crime
# A tibble: 8 x 8
  iso    year  theft robbery burglary theft_price robbery_price burglary_price
  <chr> <dbl>  <dbl>   <dbl>    <dbl>       <dbl>         <dbl>          <dbl>
1 ALB    2003   3694     199      874        32.9          115           49.3
2 ALB    2004   3694     199      874        38.2          134           57.3
3 ALB    2005   3694     199      874        42.8          150           64.2
4 ALB    2006   3450     164      779        47.0          165           70.5
5 AUS    2003 722334   14634   586266       408.4         1427          612.4 
6 AUS    2004 636717   14634   512551       481.3         1683          721.2 
7 AUS    2005 598700   14634   468558       536.7         1877          804.5 
8 AUS    2006 594111   14634   433974       564.8         1973          846.5 

I want to create new columns that contain the product of each crime type with its price, so theft x theft_price = theft_prod, etc. In my actual dataset I have more crime types so I need something that is scalable to more variables than this subset contains.

I like the syntax of the dplyr package, so I to use something like this, but I cannot find the solution. I think it is not possible to reference other variables than the ones in vars(). Correct?

crime %>%
  mutate_at(vars(theft, robbery, burglary),
            funs(prod = . * ????))

Thanks.

like image 592
avs Avatar asked Aug 15 '18 20:08

avs


2 Answers

Use dplyr and tidyr:

library(dplyr); library(tidyr);

df %>% 
    gather(crime, value, -iso, -year) %>% 
    separate(crime, c('crime', 'type'), sep='_', fill = 'right') %>% 
    replace_na(list(type = 'amount')) %>% 
    spread(type, value) %>% 
    transmute(
        iso = iso, year = year, 
        crime = paste(crime, 'prod', sep = '_'), 
        prod = amount * price
    ) %>% 
    spread(crime, prod)

#  iso year burglary_prod robbery_prod  theft_prod
#1 ALB 2003       43088.2        22885    121532.6
#2 ALB 2004       50080.2        26666    141110.8
#3 ALB 2005       56110.8        29850    158103.2
#4 ALB 2006       54919.5        27060    162150.0
#5 AUS 2003   359029298.4     20882718 295001205.6
#6 AUS 2004   369651781.2     24629022 306451892.1
#7 AUS 2005   376954911.0     27468018 321322290.0
#8 AUS 2006   367358991.0     28872882 335553892.8

Another option without data reshaping, assuming the columns' names follow the crime_price convention:

library(tidyverse)
# find out the crimes columns
crimes = grep('^(?!.*_price$)', names(df)[-c(1,2)], perl = T, value = T)
# construct the crimes prices columns
crimes_prices = paste(crimes, 'price', sep = '_')
crimes_prod = paste(crimes, 'prod', sep = '_')

# loop through crime and crime price columns and multiply them
map2(crimes, crimes_prices, ~ df[[.x]] * df[[.y]]) %>% 
    set_names(crimes_prod) %>% 
    as_tibble() %>% 
    bind_cols(select(df, iso, year))

# A tibble: 8 x 5
#  theft_prod robbery_prod burglary_prod iso    year
#       <dbl>        <int>         <dbl> <fct> <int>
#1    121533.        22885        43088. ALB    2003
#2    141111.        26666        50080. ALB    2004
#3    158103.        29850        56111. ALB    2005
#4    162150         27060        54920. ALB    2006
#5 295001206.     20882718    359029298. AUS    2003
#6 306451892.     24629022    369651781. AUS    2004
#7 321322290      27468018    376954911  AUS    2005
#8 335553893.     28872882    367358991  AUS    2006
like image 74
Psidom Avatar answered Oct 21 '22 05:10

Psidom


Doing this kind of manipulation in the tidyverse is best done by making sure your data is tidy by reshaping it. A purrr approach is also possible but is likely reliant on the order of your columns, which might not always be reliable. Instead, you can do the following:

  1. gather up all your measure columns
  2. mutate a new column measure_type that indicates whether it is a count or price, and remove the _price from crime_type. Now we have separate columns for the type of crime and the metric we are using for that crime. Each row is a single iso-year-crime-metric combination.
  3. spread the crime types back out so now we have separate count and price columns for all crimes, and then multiply with mutate.
  4. (optional) if you want to put it back in your wide format, we just gather up count and price and our new product column, unite to combine with the crime type and spread back out.
library(tidyverse)
tbl <- read_table2(
"iso    year  theft robbery burglary theft_price robbery_price burglary_price
ALB    2003   3694     199      874        32.9          115           49.3
ALB    2004   3694     199      874        38.2          134           57.3
ALB    2005   3694     199      874        42.8          150           64.2
ALB    2006   3450     164      779        47.0          165           70.5
AUS    2003 722334   14634   586266       408.4         1427          612.4
AUS    2004 636717   14634   512551       481.3         1683          721.2
AUS    2005 598700   14634   468558       536.7         1877          804.5
AUS    2006 594111   14634   433974       564.8         1973          846.5"
)
tidy_tbl <- tbl %>%
  gather(crime_type, measure, -iso, - year) %>%
  mutate(
    measure_type = if_else(str_detect(crime_type, "_price$"), "price", "count"),
    crime_type = str_remove(crime_type, "_price")
    ) %>%
  spread(measure_type, measure) %>%
  mutate(product = count * price)
tidy_tbl
#> # A tibble: 24 x 6
#>    iso    year crime_type count price product
#>    <chr> <int> <chr>      <dbl> <dbl>   <dbl>
#>  1 ALB    2003 burglary     874  49.3  43088.
#>  2 ALB    2003 robbery      199 115    22885 
#>  3 ALB    2003 theft       3694  32.9 121533.
#>  4 ALB    2004 burglary     874  57.3  50080.
#>  5 ALB    2004 robbery      199 134    26666 
#>  6 ALB    2004 theft       3694  38.2 141111.
#>  7 ALB    2005 burglary     874  64.2  56111.
#>  8 ALB    2005 robbery      199 150    29850 
#>  9 ALB    2005 theft       3694  42.8 158103.
#> 10 ALB    2006 burglary     779  70.5  54920.
#> # ... with 14 more rows

tidy_tbl %>%
  gather(measure_type, measure, count:product) %>% 
  unite("colname", crime_type, measure_type) %>%
  spread(colname, measure)
#> # A tibble: 8 x 11
#>   iso    year burglary_count burglary_price burglary_product robbery_count
#>   <chr> <int>          <dbl>          <dbl>            <dbl>         <dbl>
#> 1 ALB    2003            874           49.3           43088.           199
#> 2 ALB    2004            874           57.3           50080.           199
#> 3 ALB    2005            874           64.2           56111.           199
#> 4 ALB    2006            779           70.5           54920.           164
#> 5 AUS    2003         586266          612.        359029298.         14634
#> 6 AUS    2004         512551          721.        369651781.         14634
#> 7 AUS    2005         468558          804.        376954911          14634
#> 8 AUS    2006         433974          846.        367358991          14634
#> # ... with 5 more variables: robbery_price <dbl>, robbery_product <dbl>,
#> #   theft_count <dbl>, theft_price <dbl>, theft_product <dbl>

Created on 2018-08-15 by the reprex package (v0.2.0).

like image 21
Calum You Avatar answered Oct 21 '22 04:10

Calum You