Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr : how-to programmatically full_join dataframes contained in a list of lists?

Context and data structure

I'll share with you a simplified version of my huge dataset. This simplified version fully respects the structure of my original dataset but contains less list elements, dataframes, variables and observations than the original one.

According to the most upvoted answer to the question : How to make a great R reproducible example ?, I share my dataset using the output of dput(query1) to give you something that can be immediately used in R by copy/paste the following code block in the R console :

       structure(list(plu = structure(list(year = structure(list(id = 1:3,
    station = 100:102, pluMean = c(0.509068994778059, 1.92866478959912,
    1.09517453602154), pluMax = c(0.0146962179957886, 0.802984389130343,
    2.48170762478472)), .Names = c("id", "station", "pluMean",
"pluMax"), row.names = c(NA, -3L), class = "data.frame"), month = structure(list(
    id = 1:3, station = 100:102, pluMean = c(0.66493845927034,
    -1.3559338786041, 0.195600637750077), pluMax = c(0.503424623872161,
    0.234402501255681, -0.440264545434053)), .Names = c("id",
"station", "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame"),
    week = structure(list(id = 1:3, station = 100:102, pluMean = c(-0.608295829330578,
    -1.10256919591373, 1.74984007126193), pluMax = c(0.969668266601551,
    0.924426323739882, 3.47460867665884)), .Names = c("id", "station",
    "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week")), tsa = structure(list(year = structure(list(
    id = 1:3, station = 100:102, tsaMean = c(-1.49060721773042,
    -0.684735418997484, 0.0586655881113975), tsaMax = c(0.25739838787582,
    0.957634817758648, 1.37198023881125)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
    month = structure(list(id = 1:3, station = 100:102, tsaMean = c(-0.684668662999479,
    -1.28087846387974, -0.600175481941456), tsaMax = c(0.962916941685075,
    0.530773351897188, -0.217143593955998)), .Names = c("id",
    "station", "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
    week = structure(list(id = 1:3, station = 100:102, tsaMean = c(0.376481732842365,
    0.370435880636005, -0.105354927593471), tsaMax = c(1.93833635147645,
    0.81176751708868, 0.744932493064975)), .Names = c("id", "station",
    "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week"))), .Names = c("plu", "tsa"))

After executing this, if you execute str(query1), you'll get the structure of my example dataset as :

    > str(query1)
List of 2
 $ plu:List of 3
  ..$ year :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] 0.509 1.929 1.095
  .. ..$ pluMax : num [1:3] 0.0147 0.803 2.4817
  ..$ month:'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] 0.665 -1.356 0.196
  .. ..$ pluMax : num [1:3] 0.503 0.234 -0.44
  ..$ week :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] -0.608 -1.103 1.75
  .. ..$ pluMax : num [1:3] 0.97 0.924 3.475
 $ tsa:List of 3
  ..$ year :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] -1.4906 -0.6847 0.0587
  .. ..$ tsaMax : num [1:3] 0.257 0.958 1.372
  ..$ month:'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] -0.685 -1.281 -0.6
  .. ..$ tsaMax : num [1:3] 0.963 0.531 -0.217
  ..$ week :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] 0.376 0.37 -0.105
  .. ..$ tsaMax : num [1:3] 1.938 0.812 0.745

So how does it reads ? I have big list (query1) made of 2 parameters elements (plu & tsa), each of these 2 parameters elements being a list made of 3 elements (year, month, week), each of these 3 elements being a timeInterval dataframe made of the same 4 variables columns (id, station, mean, max) and exactly the same number of observations (3).

What I want to achieve

I want to programmatically full_join by id & station all the timeInterval dataframes with the same name (year, month, week). This means that I should end up with a new list (query1Changed) containing 3 dataframes (year, month, week), each of them containing 5 columns (id, station, pluMean, pluMax, tsaMean, tsaMax) and 3 observations. Schematically, I need to arrange data as follows :

do a full_join by station and id of :

  • dfquery1$plu$year with df query1$tsa$year
  • dfquery1$plu$month with df query1$tsa$month
  • dfquery1$plu$week with df query1$tsa$week

Or expressed with another representation :

  • dfquery1[[1]][[1]] with df query1[[2]][[1]]
  • dfquery1[[1]][[2]] with df query1[[2]][[2]]
  • dfquery1[[1]][[3]] with df query1[[2]][[3]]

And expressed programmatically (n being the total number of elements of the big list) :

  • dfquery1[[i]][[1]] with df query1[[i+1]][[1]]... with df query1[[n]][[1]]
  • dfquery1[[i]][[2]] with df query1[[i+1]][[2]]... with df query1[[n]][[2]]
  • dfquery1[[i]][[3]] with df query1[[i+1]][[3]]... with df query1[[n]][[3]]

I need to achieve this programmatically because in my real project I could encounter another big list with more than 2 parameters elements and more than 4 variables columns in each of their timeIntervals dataframes .

In my analysis, what will always remain the same is the fact that all the parameters elements of another big list will always have the same number of timeIntervals dataframes with the same names and each of these timeIntervals dataframes will always have the same number of observations and always share 2 columns with exactly the same name and same values (id & station)

What i have succeeded

Executing the following piece of code :

> query1Changed <- do.call(function(...) mapply(bind_cols, ..., SIMPLIFY=F), args = query1)

arranges the data as expected. However this is not a neat solution since we end up with repeated column names (id & station) :

> str(query1Changed)
List of 3
 $ year :'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] 0.509 1.929 1.095
  ..$ pluMax  : num [1:3] 0.0147 0.803 2.4817
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] -1.4906 -0.6847 0.0587
  ..$ tsaMax  : num [1:3] 0.257 0.958 1.372
 $ month:'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] 0.665 -1.356 0.196
  ..$ pluMax  : num [1:3] 0.503 0.234 -0.44
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] -0.685 -1.281 -0.6
  ..$ tsaMax  : num [1:3] 0.963 0.531 -0.217
 $ week :'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] -0.608 -1.103 1.75
  ..$ pluMax  : num [1:3] 0.97 0.924 3.475
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] 0.376 0.37 -0.105
  ..$ tsaMax  : num [1:3] 1.938 0.812 0.745

We could add a second process to "clean" the data but this would not be the most efficient solution. So I don't want to use this workaround.

Next, I've tried doing the same using dplyr full_join but with no success. Executing the following code :

> query1Changed <- do.call(function(...) mapply(full_join(..., by = c("station", "id")), ..., SIMPLIFY=F), args = query1)

returns the following error :

Error in UseMethod("full_join") :
  no applicable method for 'full_join' applied to an object of class "list"

So, how should I write my full_join expression to make it run on the dataframes ?

or is there another way to perform my data transformation efficiently ?

What I've found on the web that could help ?

I've found the related questions but I still can't figure out how to adapt their solutions to my problem.

On stackoverflow : - Merging a data frame from a list of data frames [duplicate] - Simultaneously merge multiple data.frames in a list - Joining list of data.frames from map() call - Combining elements of list of lists by index

On blogs : - Joining a List of Data Frames with purrr::reduce()

Any help would be greatly appreciated. I hope I've made the description of my problem clear. I've started programming with R only 2 months ago so please be indulgent if the solution is obvious ;)

like image 766
pokyah Avatar asked Aug 30 '17 14:08

pokyah


1 Answers

First of all, thanks for posting a really great description of what your problem is and which requirements you need for your solution.

First, I'd use purrr::map2 to create a function that takes two lists of data frames and joins them in parallel. That is, it joins the first data frame of plu with the first of tsa ... the last of plu with the last of tsa, and returns the results as a list.

> join_each = function(x, y) map2(x, y, full_join)
> join_each(query1$plu, query1$tsa)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
  id station  pluMean     pluMax     tsaMean    tsaMax
1  1     100 0.509069 0.01469622 -1.49060722 0.2573984
2  2     101 1.928665 0.80298439 -0.68473542 0.9576348
3  3     102 1.095175 2.48170762  0.05866559 1.3719802

$month
  id station    pluMean     pluMax    tsaMean     tsaMax
1  1     100  0.6649385  0.5034246 -0.6846687  0.9629169
2  2     101 -1.3559339  0.2344025 -1.2808785  0.5307734
3  3     102  0.1956006 -0.4402645 -0.6001755 -0.2171436

$week
  id station    pluMean    pluMax    tsaMean    tsaMax
1  1     100 -0.6082958 0.9696683  0.3764817 1.9383364
2  2     101 -1.1025692 0.9244263  0.3704359 0.8117675
3  3     102  1.7498401 3.4746087 -0.1053549 0.7449325

Well, this works when there are only two of them, but you want it to work when there are n lists of data.frames. Now you are going to need purrr::reduce:

> reduce(query1, join_each)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
  id station  pluMean     pluMax     tsaMean    tsaMax
1  1     100 0.509069 0.01469622 -1.49060722 0.2573984
2  2     101 1.928665 0.80298439 -0.68473542 0.9576348
3  3     102 1.095175 2.48170762  0.05866559 1.3719802

$month
  id station    pluMean     pluMax    tsaMean     tsaMax
1  1     100  0.6649385  0.5034246 -0.6846687  0.9629169
2  2     101 -1.3559339  0.2344025 -1.2808785  0.5307734
3  3     102  0.1956006 -0.4402645 -0.6001755 -0.2171436

$week
  id station    pluMean    pluMax    tsaMean    tsaMax
1  1     100 -0.6082958 0.9696683  0.3764817 1.9383364
2  2     101 -1.1025692 0.9244263  0.3704359 0.8117675
3  3     102  1.7498401 3.4746087 -0.1053549 0.7449325

It computes join_each(query1[[1]], query1[[2]]) %>% join_each(query1[[3]]) ... %>% join_each(query1[[n]]).

Update: The following one-liner does the same: reduce(query1, map2, full_join). It isn't as readable, though.

like image 130
Luiz Rodrigo Avatar answered Oct 30 '22 09:10

Luiz Rodrigo