I have two data frames that have some columns with the same names and others with different names. The data frames look something like this: <pre class="prettyprint"><code>df1 ID hello world hockey soccer 1 1 NA NA 7 4 2 2 NA NA 2 5 3 3 10 8 8 23 4 4 4 17 5 12 5 5 NA NA 3 43 df2 ID hello world football baseball 1 1 2 3 43 6 2 2 5 1 24 32 3 3 NA NA 2 23 4 4 NA NA 5 15 5 5 9 7 12 23 </code></pre> As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other. What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this: <pre class="prettyprint"><code> ID hello world hockey soccer football baseball 1 1 2 3 7 4 43 6 2 2 5 3 2 5 24 32 3 3 10 8 8 23 2 23 4 4 4 17 5 12 5 15 5 5 9 7 3 43 12 23 </code></pre> I'm pretty new at R so the only codes I've tried are variations on <code>merge</code> and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable. However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually. Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides: <pre class="prettyprint"><code>mydata <- merge(df1, df2, by=c("ID"), all = TRUE) </code></pre> To reproduce the data frames: <pre class="prettyprint"><code>df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", "soccer"), class = "data.frame", row.names = c(NA, -5L)) </code></pre>

Nobody's posted a <code>dplyr</code> solution, so here's a succinct option in <code>dplyr</code>. The approach is simply to do a <code>full_join</code> that combines all rows, then <code>group</code> and <code>summarise</code> to remove the redundant missing cells. <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df1 %>% full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>% group_by(ID) %>% summarize_all(na.omit) #> # A tibble: 5 x 7 #> ID hello world hockey soccer football baseball #> <int> <int> <int> <int> <int> <int> <int> #> 1 1 2 3 7 4 43 6 #> 2 2 5 1 2 5 24 32 #> 3 3 10 8 8 23 2 23 #> 4 4 4 17 5 12 5 15 #> 5 5 9 7 3 43 12 2 </code></pre> Created on 2018-07-13 by the reprex package (v0.2.0).

merge/combine columns with same name but incomplete data

Tags:

I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:

Click to copy

df1       ID hello world hockey soccer     1  1    NA    NA      7      4     2  2    NA    NA      2      5     3  3    10     8      8     23     4  4     4    17      5     12     5  5    NA    NA      3     43  df2           ID hello world football baseball     1  1     2     3       43        6     2  2     5     1       24       32     3  3    NA    NA        2       23     4  4    NA    NA        5       15     5  5     9     7       12       23

As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other.

What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this:

Click to copy

  ID hello world hockey soccer football baseball 1  1     2     3      7      4        43       6 2  2     5     3      2      5        24      32 3  3    10     8      8     23         2      23 4  4     4    17      5     12         5      15 5  5     9     7      3     43        12      23

I'm pretty new at R so the only codes I've tried are variations on merge and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable. However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually.

Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides:

Click to copy

mydata <- merge(df1, df2, by=c("ID"), all = TRUE)

To reproduce the data frames:

Click to copy

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9),         world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12),         baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world",         "football", "baseball"), class = "data.frame", row.names = c(NA, -5L))   df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA),         world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3),         soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey",         "soccer"), class = "data.frame", row.names = c(NA, -5L))

798

asked Nov 27 '14 09:11

abclist19

2 Answers

Here's an approach that involves melting your data, merging the molten data, and using dcast to get it back to a wide form. I've added comments to help understand what is going on.

Click to copy

## Required packages library(data.table) library(reshape2)  dcast.data.table(   merge(     ## melt the first data.frame and set the key as ID and variable     setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),      ## melt the second data.frame     melt(as.data.table(df2), id.vars = "ID"),      ## you'll have 2 value columns...     all = TRUE)[, value := ifelse(       ## ... combine them into 1 with ifelse       is.na(value.x), value.y, value.x)],    ## This is your reshaping formula   ID ~ variable, value.var = "value") #    ID hello world football baseball hockey soccer # 1:  1     2     3       43        6      7      4 # 2:  2     5     1       24       32      2      5 # 3:  3    10     8        2       23      8     23 # 4:  4     4    17        5       15      5     12 # 5:  5     9     7       12       23      3     43

answered Sep 27 '22 18:09

A5C1D2H2I1M1N2O1R2T1

Nobody's posted a dplyr solution, so here's a succinct option in dplyr. The approach is simply to do a full_join that combines all rows, then group and summarise to remove the redundant missing cells.

Click to copy

library(tidyverse) df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec")) df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))  df1 %>%   full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%   group_by(ID) %>%   summarize_all(na.omit) #> # A tibble: 5 x 7 #>      ID hello world hockey soccer football baseball #>   <int> <int> <int>  <int>  <int>    <int>    <int> #> 1     1     2     3      7      4       43        6 #> 2     2     5     1      2      5       24       32 #> 3     3    10     8      8     23        2       23 #> 4     4     4    17      5     12        5       15 #> 5     5     9     7      3     43       12        2

Created on 2018-07-13 by the reprex package (v0.2.0).

answered Sep 27 '22 19:09

thc

Related questions
                            
                                Pycharm debugger does not stop on breakpoints
                            
                                How does the 'infix' work?
                            
                                prevent bootstrap modal from opening when a button is clicked
                            
                                Extend GridView ActionColumn with additional icon
                            
                                XMPP client using Smack 4.1 giving NullPointerException during login
                            
                                Get Random Color [duplicate]
                            
                                How exactly works the Spring session scope of a bean? what is the default scope of a bean in the web context?
                            
                                How to do findAll in the new mongo C# driver and make it synchronous
                            
                                Why does GCC define unary operator '&&' instead of just using '&'?
                            
                                Where to change minSdkVersion setting in PhoneGap app
                            
                                sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"
                            
                                Is there a timeout for acking RabbitMQ messages?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

merge/combine columns with same name but incomplete data

Tags:

abclist19

People also ask

2 Answers

A5C1D2H2I1M1N2O1R2T1

thc

Recent Activity

Donate For Us