Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two data frames in R that have common and uncommon samples

Tags:

merge

r

unique

I have two data frames, Data1 and Data2, that I want to merge based on a the variable "ID".

This example data may be downloaded here: http://dl.dropbox.com/u/52600559/example.RData

Here is the first data frame:

> Data1
   ID     Fruit  Color Weight
1   1     Apple    Red      5
2   2    Orange Orange      7
3   3    Banana Yellow      3
4   4      Pear  Green      5
5   5    Tomato    Red      4
6   6     Berry   Blue      4
7   7  Mandarin Orange      4
8   8 Pineapple Yellow      9
9   9 Nectarine Orange      5
10 10      Beet    Red      5

And here is the second data frame:

> Data2
   ID       Fruit  Color Weight
1   1       Apple    Red      5
2   2      Orange Orange      7
3   3      Banana Yellow      3
4   4        Pear  Green      5
5   5      Tomato    Red      4
6  11 Pomegranate    Red      6
7  12       Grape  Green      4
8  13   Cranberry    Red      4
9  14       Melon   Pink      5
10 15     Pumpkin Orange     10

I have tried to merge them like this:

> merge(Data1, Data2, by = "ID", sort = FALSE, all.x = TRUE, all.y = TRUE)
   ID   Fruit.x Color.x Weight.x     Fruit.y Color.y Weight.y
1   1     Apple     Red        5       Apple     Red        5
2   2    Orange  Orange        7      Orange  Orange        7
3   3    Banana  Yellow        3      Banana  Yellow        3
4   4      Pear   Green        5        Pear   Green        5
5   5    Tomato     Red        4      Tomato     Red        4
6   9 Nectarine  Orange        5        <NA>    <NA>       NA
7   6     Berry    Blue        4        <NA>    <NA>       NA
8   7  Mandarin  Orange        4        <NA>    <NA>       NA
9   8 Pineapple  Yellow        9        <NA>    <NA>       NA
10 10      Beet     Red        5        <NA>    <NA>       NA
11 14      <NA>    <NA>       NA       Melon    Pink        5
12 11      <NA>    <NA>       NA Pomegranate     Red        6
13 12      <NA>    <NA>       NA       Grape   Green        4
14 13      <NA>    <NA>       NA   Cranberry     Red        4
15 15      <NA>    <NA>       NA     Pumpkin  Orange       10

As you can see, both data frames have many of the same variables. However, some IDs in Data1 are not in Data2, and vice versa. Moreover, some IDs are located in both data frames.

Question 1: I want to merge all of the columns that are shown above as well. So, I want "Fruit.x" to be merged with "Fruit.y". into one column called "Fruit". How can I do this?

Question 2: What if, for one of the samples that happens to be present in both Data1 and Data2, one of the values does not agree. So for sample ID 1, if Fruit.x is Apple, but Fruit.y is incorrectly coded as Aple (with a misspelling), is there a way I can check all of these instances quickly so that I can select which one is correct? Or can I tell R to always consider Data1 to be correct versus Data2 when this happens?

Thanks to anyone who can help!!

like image 391
Alexander Avatar asked Feb 15 '12 16:02

Alexander


2 Answers

Try this:

merge(Data1, Data2, all = TRUE)

and for spellings try this where amatch are the approximate matches to fruit and near contains the approximate matches that do not match exactly:

for(fruit in Data1$Fruit) {
    amatch <- agrep(fruit, Data2$Fruit, value = TRUE)
    near <- amatch[amatch != fruit]
    if (length(near) > 0) cat(fruit, ":", near, "\n")
}

Using the data provided this gives:

Berry : Cranberry 

EDIT: improved clarity of code

like image 75
G. Grothendieck Avatar answered Sep 21 '22 01:09

G. Grothendieck


To answer question 1:

merge(data1, data2, all=T)

should give you what you're looking for. It won't deal with misspellings though. you would have to deal with them separately. unique is a good tool for finding them as is tolower to normalize capitalization issues.

like image 28
Justin Avatar answered Sep 20 '22 01:09

Justin