Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The most effective way to merge/combine two data sets by overlapping row.names and mean values

Tags:

r

I would like to find the most effective way for combining two data frames and average the values in the columns with different row.names . So, I would like to take jsut overlapping row.names from both data and combine them into one. The values from columns should be avaraged by mean. The example datas:

mtcars <- 
structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 
19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 
8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4), 
    disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 
    167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 
    71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 
    301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95, 
    123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 
    150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9, 
    3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 
    3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 
    3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
    ), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 
    3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 
    1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 
    1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61, 
    19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 
    18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 
    17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
    ), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
    0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1, 
    1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3, 
    3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 
    3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 4, 1, 1, 2, 1, 4, 
    2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 
    2, 2, 4, 6, 8, 2)), .Names = c("mpg", "cyl", "disp", "hp", 
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4", 
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", 
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
"Merc 280C", "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", 
"Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
"Volvo 142E"), class = "data.frame")

Second data:

mtcars11 <- 
structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 
24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 
30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 
19.7), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 
8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6), disp = c(160, 160, 
108, 258, 360, 225, 360, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 
275.8, 472, 460, 440, 78.7, 75.7, 71.1, 120.1, 318, 304, 350, 
400, 79, 120.3, 95.1, 351, 145), hp = c(110, 110, 93, 110, 175, 
105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 
52, 65, 97, 150, 150, 245, 175, 66, 91, 113, 264, 175), drat = c(3.9, 
3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 
3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 3.15, 
3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62), wt = c(2.62, 2.875, 
2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44, 3.44, 4.07, 
3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 
3.435, 3.84, 3.845, 1.935, 2.14, 1.513, 3.17, 2.77), qsec = c(16.46, 
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 
17.4, 17.6, 18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 
16.87, 17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5), vs = c(0, 
0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 
0, 0, 0, 1, 0, 1, 0, 0), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1), 
    gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 
    3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5), carb = c(4, 4, 
    1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 
    2, 2, 4, 2, 1, 2, 2, 4, 6)), .Names = c("mpg", "cyl", "disp", 
"hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4", 
"Chrysler", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", 
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280", 
"Merc 280C", "Merc 450SE", "Nexia", "Merc 450SLC", "Cadillac Fleetwood", 
"Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
"Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
"Camaro Z28", "Pontiac Firebirda", "Punto", "Porsche 914-2", 
"Lotus Europa", "Ford Pantera T", "Ferrari Dino"), class = "data.frame")

So the solution which came to my mind is that (the long one):

vec_names_mt <- row.names(mtcars) ## so we the row.names from first data
vec_names_mt11 <- row.names(mtcars11) ## so we the row.names from second data

vec_inter <- intersect(vec_names_mt, vec_names_mt11) ## find overlapping names

data_mt <- mtcars[row.names(mtcars) %in% vec_inter, ] ## take the rows from first data which overlaps
data_mt11 <- mtcars11[row.names(mtcars11) %in% vec_inter, ] ## take the rows from second data which overlaps

How can we combine them and average the values ? Any idea how to do that in the simplest way ?

like image 906
Shaxi Liver Avatar asked Apr 16 '15 12:04

Shaxi Liver


1 Answers

Assuming d1 and d2 are your data.frames, here's how I'd approach it. You'll have to use development version of data.table (v1.9.5) though, for mget to work.

require(data.table) # v1.9.5

setkey(setDT(d1, keep.rownames=TRUE), rn)
setkey(setDT(d2, keep.rownames=TRUE), rn)

xcols = names(d1)[-1L]
icols = paste("i.", xcols, sep="")

foo <- function(a, b) mean(c(a, b), na.rm=TRUE)
d1[d2, Map(foo, mget(xcols), mget(icols)), by=.EACHI, nomatch=0L]

We first convert the data.frames to data.tables by reference using setDT, and convert row names to a new column (which will automatically be named rn), and set key on that column.

setkey() reorders a data.table by the columns specified, and marks those columns as key columns, which will help us perform a join (on those key columns).

In data.tables, joins can be accomplished by using the x[i] notation as well as merge() function (there's a data.table method implemented), but x[i] is much more powerful and flexible. The syntax x[i] joins each row of i to matching rows in x (on the key columns).

So, d1[d2] would return for each row in d2 the matching rows in d1, along with all the other columns in d2.

d1[d2, nomatch=0L] is the equivalent of an inner join, where only rows that matches are returned.

d1[d2, Map(foo, mget(xcols), mget(icols)), by=.EACHI, nomatch=0L] evaluates the expression in j = Map(...), for each row in d2 - hence by = .EACHI.

To sum things up, for each row in d2, find the matching rows in d1. Extract the columns specified in xcols and icols just for that matching rows, and apply the function foo() which will concatenate the vectors and take their mean(). And do this for each row of d2 (by = .EACHI). Ignore rows in d2 that doesn't have any matches in d1 on key column (nomatch=0L).

Hope this helps.

like image 57
Arun Avatar answered Sep 30 '22 23:09

Arun