Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

melt dataframe - multiple columns - "Enhanced (new) functionality from data.tables"

Tags:

r

data.table

melt

Update: I should have been clearer that I was trying to check out the Enhanced functionality in reshaping using data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html. Updated the title.

I have this data set with two sets of variables - Credit_Risk_Capital and Name_concentration. They are calculated per 2 methodologies - New and Old. When I melt them using the data.table package, the variable names default to 1 and 2. How can I change them to just Credit_Risk_Capital and Name_Concentration.

Here is the data set

    df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
             NameConcentration_old= rnorm(100, mean = 100, sd = 10),
             Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
             NameConcentration_New = rnorm(100, mean = 40, sd = 10))
    old <- c('Credit_risk_Capital_old','NameConcentration_old')
   new<-c('Credit_risk_Capital_New','NameConcentration_New')
  t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))

Now instead of the elements in the CapitalChargeType Column getting tagged as 1's and 2's, I want them to be changed to Credit_risk_Capital and NameConcentration. I can obviously change them in a subsequent step using a 'match' function, but is there anyway I can do it within melt itself.

like image 346
ashleych Avatar asked Sep 02 '25 02:09

ashleych


1 Answers

The issue here is that melt() doesn't know how to name the variables in case of more than one measure variable. So, it resorts to simply numbering the variables.

David has pointed out that there is a feature request. However, I will show two workarounds and compare them (plus the tidyr answer) in terms of speed.

  1. The first approach is to melt() all measure variables (which keeps the variable names), create new variable names, and dcast() the temporary result again to end up with two value columns. This recast approach is also being used by austensen.
  2. The second approach is what the OP is asking for (melting two value columns simultaneously) but includes a simple way to rename the variables afterwards.

Recast

library(data.table)   # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") := 
       tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)] 
dcast(long, id + CapitalChargeType ~ age)
      id   CapitalChargeType       New       old
  1:   1 Credit_risk_Capital 204.85227 327.57606
  2:   1   NameConcentration  34.20043 104.14524
  3:   2 Credit_risk_Capital 206.96769 416.64575
  4:   2   NameConcentration  30.46721  95.25282
  5:   3 Credit_risk_Capital 201.85514 465.06647
 ---                                            
196:  98   NameConcentration  45.38833  90.34097
197:  99 Credit_risk_Capital 203.53625 458.37501
198:  99   NameConcentration  40.14643 101.62655
199: 100 Credit_risk_Capital 203.19156 527.26703
200: 100   NameConcentration  30.83511  79.21762

Note that the variable names are split at the last _ before the final old or New, resp. This is achieved by using a regular expression with positive look-ahead: "_(?=(New|old)$)"

Melting two columns and renaming variables

Here, we pick-up David's suggestion to use the patterns() function which is equivalent to specify a list of measure variables.

As a side note: The order of the list (or the patterns) determines the order of the value columns:

melt(df, measure.vars = patterns("New$", "old$"))
      id variable    value1    value2
  1:   1        1 204.85227 327.57606
  2:   2        1 206.96769 416.64575
  3:   3        1 201.85514 465.06647
  ...
melt(df, measure.vars = patterns("old$", "New$"))
      id variable    value1    value2
  1:   1        1 327.57606 204.85227
  2:   2        1 416.64575 206.96769
  3:   3        1 465.06647 201.85514
  ...

As already pointed out by the OP, melting with multiple measure variables

long <- melt(df, measure.vars = patterns("old$", "New$"), 
     variable.name = "CapitalChargeType",
     value.name = c("old", "New")) 

returns numbers instead of the variable names:

str(long)
Classes ‘data.table’ and 'data.frame':    200 obs. of  4 variables:
 $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ old              : num  328 417 465 259 426 ...
 $ New              : num  205 207 202 207 203 ...
 - attr(*, ".internal.selfref")=<externalptr>

Fortunately, these are factors which can be changed easily by replacing the factor levels with help of the forcats package:

long[, CapitalChargeType := forcats::lvls_revalue(
  CapitalChargeType, 
  c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 327.57606 204.85227
  2:   1   NameConcentration 104.14524  34.20043
  3:   2 Credit_risk_Capital 416.64575 206.96769
  4:   2   NameConcentration  95.25282  30.46721
  5:   3 Credit_risk_Capital 465.06647 201.85514
 ---                                            
196:  98   NameConcentration  90.34097  45.38833
197:  99 Credit_risk_Capital 458.37501 203.53625
198:  99   NameConcentration 101.62655  40.14643
199: 100 Credit_risk_Capital 527.26703 203.19156
200: 100   NameConcentration  79.21762  30.83511

Note that melt() numbers the variables in the order the columns appear in df.

reshape()

The stats package of base R has a reshape() function. Unfortunately, it doesn't accept regular expressions with positive look-ahead. So, the automatic guessing of variable names can't be used. Instead, all relevant parameters have to be specified explicitely:

old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long", 
        timevar = "CapitalChargeType",
        times = c("Credit_risk_Capital", "NameConcentration"),
        v.names = c("old", "New"))
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 367.95567 194.93598
  2:   2 Credit_risk_Capital 467.98061 215.39663
  3:   3 Credit_risk_Capital 363.75586 201.72794
  4:   4 Credit_risk_Capital 433.45070 191.64176
  5:   5 Credit_risk_Capital 408.55776 193.44071
 ---                                            
196:  96   NameConcentration  93.67931  47.85263
197:  97   NameConcentration 101.32361  46.94047
198:  98   NameConcentration 104.80926  33.67270
199:  99   NameConcentration 101.33178  32.28041
200: 100   NameConcentration  85.37136  63.57817

Benchmarking

The benchmark includes all 4 approaches discussed so far:

  • tidyr with modifications to use resgular expressions with positive look-ahead,
  • recast,
  • melt() of multiple value variables, and
  • reshape().

The benchmark data consist of 100 K rows:

n_rows <- 100L
set.seed(1234L)
df <- data.table(
  id = c(1:n_rows),
  Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
  NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
  Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
  NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))

For benchmarking, the microbenchmark package is used:

library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
  tidyr = {
    r_tidyr <- df %>% 
      dplyr::as_data_frame() %>%  
      tidyr::gather("key", "value", -id) %>% 
      tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>% 
      tidyr::spread(age, value)
  },
  recast = {
    r_recast <- dcast(
      melt(df, id.vars = "id")[
        , c("CapitalChargeType", "age") := 
          tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)], 
      id + CapitalChargeType ~ age)
  },
  m2col = {
    r_m2col <- melt(df, measure.vars = patterns("New$", "old$"), 
                    variable.name = "CapitalChargeType",
                    value.name = c("New", "old"))[
                      , CapitalChargeType := forcats::lvls_revalue(
                        CapitalChargeType, 
                        c("Credit_risk_Capital", "NameConcentration"))][order(id)]
  },
  reshape = {
    r_reshape <- reshape(df, varying = list(new, old), direction = "long", 
                         timevar = "CapitalChargeType",
                         times = c("Credit_risk_Capital", "NameConcentration"),
                         v.names = c("New", "old")
    )
  },
  times = 10L
)
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
   tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188    10
  recast 215.35813 223.60715 287.28034 261.23333 338.36813  477.3355    10
   m2col  10.28721  11.35237  38.72393  14.46307  23.64113  154.3357    10
 reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892    10

The timings show that melt() of two columns simultaneously is about 15 times faster than the second fastest, reshape(). Both recast variants fall behind because they both require two reshaping operations. The tidyr solution is particularly slow.

like image 108
Uwe Avatar answered Sep 06 '25 13:09

Uwe