Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge/combine columns with same name but incomplete data

Tags:

I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:

df1       ID hello world hockey soccer     1  1    NA    NA      7      4     2  2    NA    NA      2      5     3  3    10     8      8     23     4  4     4    17      5     12     5  5    NA    NA      3     43  df2           ID hello world football baseball     1  1     2     3       43        6     2  2     5     1       24       32     3  3    NA    NA        2       23     4  4    NA    NA        5       15     5  5     9     7       12       23 

As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other.

What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this:

  ID hello world hockey soccer football baseball 1  1     2     3      7      4        43       6 2  2     5     3      2      5        24      32 3  3    10     8      8     23         2      23 4  4     4    17      5     12         5      15 5  5     9     7      3     43        12      23 

I'm pretty new at R so the only codes I've tried are variations on merge and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable. However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually.

Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides:

mydata <- merge(df1, df2, by=c("ID"), all = TRUE) 

To reproduce the data frames:

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9),         world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12),         baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world",         "football", "baseball"), class = "data.frame", row.names = c(NA, -5L))   df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA),         world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3),         soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey",         "soccer"), class = "data.frame", row.names = c(NA, -5L)) 
like image 798
abclist19 Avatar asked Nov 27 '14 09:11

abclist19


People also ask

What happens if I do not specify the Merge column (s) with on?

If you do not specify the merge column (s) with on, then Pandas will use any columns with the same name as the merge keys. Before getting into the details of how to use merge (), you should first understand the various forms of joins:

How to merge columns of data into one cell without losing data?

Merge columns of data into one cell without losing data by Clipboard. To merge multiple columns into only one cell without losing any data in Excel, you can use the Clipboard to solve the problem easily. 1. First of all, enable the Clipboard by clicking the Anchor button at the bottom-right corner of Clipboard group on the Home tab.

What are the different ways to join two columns in merge ()?

It defaults to 'inner', but other possible options include 'outer', 'left', and 'right'. on: Use this to tell merge () which columns or indices (also called key columns or key indices) you want to join on.

How to merge columns in Excel using kutools?

1 After installing Kutools for Excel, please select the columns of data that you want to merge, and then click Kutools... 2 In the Combine Columns or Rows dialog box, specify the options as follows: More ...


2 Answers

Here's an approach that involves melting your data, merging the molten data, and using dcast to get it back to a wide form. I've added comments to help understand what is going on.

## Required packages library(data.table) library(reshape2)  dcast.data.table(   merge(     ## melt the first data.frame and set the key as ID and variable     setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),      ## melt the second data.frame     melt(as.data.table(df2), id.vars = "ID"),      ## you'll have 2 value columns...     all = TRUE)[, value := ifelse(       ## ... combine them into 1 with ifelse       is.na(value.x), value.y, value.x)],    ## This is your reshaping formula   ID ~ variable, value.var = "value") #    ID hello world football baseball hockey soccer # 1:  1     2     3       43        6      7      4 # 2:  2     5     1       24       32      2      5 # 3:  3    10     8        2       23      8     23 # 4:  4     4    17        5       15      5     12 # 5:  5     9     7       12       23      3     43 
like image 75
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 27 '22 18:09

A5C1D2H2I1M1N2O1R2T1


Nobody's posted a dplyr solution, so here's a succinct option in dplyr. The approach is simply to do a full_join that combines all rows, then group and summarise to remove the redundant missing cells.

library(tidyverse) df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))  df1 %>%   full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%   group_by(ID) %>%   summarize_all(na.omit) #> # A tibble: 5 x 7 #>      ID hello world hockey soccer football baseball #>   <int> <int> <int>  <int>  <int>    <int>    <int> #> 1     1     2     3      7      4       43        6 #> 2     2     5     1      2      5       24       32 #> 3     3    10     8      8     23        2       23 #> 4     4     4    17      5     12        5       15 #> 5     5     9     7      3     43       12        2 

Created on 2018-07-13 by the reprex package (v0.2.0).

like image 24
thc Avatar answered Sep 27 '22 19:09

thc