I have two data frames, Data1 and Data2, that I want to merge based on a the variable "ID".
This example data may be downloaded here: http://dl.dropbox.com/u/52600559/example.RData
Here is the first data frame:
> Data1
ID Fruit Color Weight
1 1 Apple Red 5
2 2 Orange Orange 7
3 3 Banana Yellow 3
4 4 Pear Green 5
5 5 Tomato Red 4
6 6 Berry Blue 4
7 7 Mandarin Orange 4
8 8 Pineapple Yellow 9
9 9 Nectarine Orange 5
10 10 Beet Red 5
And here is the second data frame:
> Data2
ID Fruit Color Weight
1 1 Apple Red 5
2 2 Orange Orange 7
3 3 Banana Yellow 3
4 4 Pear Green 5
5 5 Tomato Red 4
6 11 Pomegranate Red 6
7 12 Grape Green 4
8 13 Cranberry Red 4
9 14 Melon Pink 5
10 15 Pumpkin Orange 10
I have tried to merge them like this:
> merge(Data1, Data2, by = "ID", sort = FALSE, all.x = TRUE, all.y = TRUE)
ID Fruit.x Color.x Weight.x Fruit.y Color.y Weight.y
1 1 Apple Red 5 Apple Red 5
2 2 Orange Orange 7 Orange Orange 7
3 3 Banana Yellow 3 Banana Yellow 3
4 4 Pear Green 5 Pear Green 5
5 5 Tomato Red 4 Tomato Red 4
6 9 Nectarine Orange 5 <NA> <NA> NA
7 6 Berry Blue 4 <NA> <NA> NA
8 7 Mandarin Orange 4 <NA> <NA> NA
9 8 Pineapple Yellow 9 <NA> <NA> NA
10 10 Beet Red 5 <NA> <NA> NA
11 14 <NA> <NA> NA Melon Pink 5
12 11 <NA> <NA> NA Pomegranate Red 6
13 12 <NA> <NA> NA Grape Green 4
14 13 <NA> <NA> NA Cranberry Red 4
15 15 <NA> <NA> NA Pumpkin Orange 10
As you can see, both data frames have many of the same variables. However, some IDs in Data1 are not in Data2, and vice versa. Moreover, some IDs are located in both data frames.
Question 1: I want to merge all of the columns that are shown above as well. So, I want "Fruit.x" to be merged with "Fruit.y". into one column called "Fruit". How can I do this?
Question 2: What if, for one of the samples that happens to be present in both Data1 and Data2, one of the values does not agree. So for sample ID 1, if Fruit.x is Apple, but Fruit.y is incorrectly coded as Aple (with a misspelling), is there a way I can check all of these instances quickly so that I can select which one is correct? Or can I tell R to always consider Data1 to be correct versus Data2 when this happens?
Thanks to anyone who can help!!
Try this:
merge(Data1, Data2, all = TRUE)
and for spellings try this where amatch
are the approximate matches to fruit
and near
contains the approximate matches that do not match exactly:
for(fruit in Data1$Fruit) {
amatch <- agrep(fruit, Data2$Fruit, value = TRUE)
near <- amatch[amatch != fruit]
if (length(near) > 0) cat(fruit, ":", near, "\n")
}
Using the data provided this gives:
Berry : Cranberry
EDIT: improved clarity of code
To answer question 1:
merge(data1, data2, all=T)
should give you what you're looking for. It won't deal with misspellings though. you would have to deal with them separately. unique
is a good tool for finding them as is tolower
to normalize capitalization issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With