I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.
I have made a minimal reproducible example.
a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
using merge gets part of the way:
merge(a,b, by.x="aID", by.y="bID", all=TRUE)
aID aInfo bInfo
1 1234 blue <NA>
2 3645 cerulean <NA>
3 4567 green apple
4 6789 goldenrod kiwi
5 (1234) <NA> banana
6 23645 <NA> pomegranate
7 63528973 <NA> lychee
but the output that would be liked is basically:
ID aInfo bInfo
1 1234 blue banana
2 3645 cerulean pomegranate
3 4567 green apple
4 6789 goldenrod kiwi
5 63528973 <NA> lychee
I just wondered if there was a way to incorporate grep into this or another R-tastic method?
Thanks in advance
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.
The merge() function in R combines two data frames. The most crucial requirement for connecting two data frames is that the column type is the same on which the merging occurs. The merge() function is similar to the join function in a Relational Database Management System (RDMS).
One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2) . The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.
Doing merge
on a condition is a little tricky. I don't think you can do it with merge
as it is written, so you end up having to write a custom function with by
. It is pretty inefficient, but then, so is merge
. If you have millions of rows, consider data.table
. This is how you would do a "inner join" where only rows that match are returned.
# I slightly modified your data to test multiple matches
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))
f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
Doing a full join is a little trickier. This is one way, that is still inefficient:
f<-function(x,b) {
matches<-b[agrep(x[1,1],b[,1]),]
if (nrow(matches)>0) merge(x,matches,all=TRUE)
# Ugly... but how else to create a data.frame full of NAs?
else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])
# aID aInfo bID bInfo
# 1234.1 1234 blue (1234) banana
# 1234.2 1234 blue2 (1234) banana
# 3645 3645 cerulean 23645 pomegranate
# 4567 4567 green 4567 apple
# 6789 6789 goldenrod 6789 kiwi
# bID <NA> <NA> 63528973 lychee
This is an answer using data.table
, inspired by @nograpes.
## Create example tables; I added the sarcoline cases
## so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
aInfo=c("blue","blue2","green","goldenrod","cerulean",
"sarcoline","sarcoline2"),
key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
key="bID")
## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
## aID aInfo bID bInfo
## 1: 1234 blue (1234) banana
## 2: 1234 blue2 (1234) banana
## 3: 3645 cerulean 23645 pomegranate
## 4: 4567 green 4567 apple
## 5: 6789 goldenrod 6789 kiwi
So far we've only got an inner join, so now let's add the unmatched rows from the original tables:
ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)
These steps are optional and are included to match the output from the OP:
## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]
## Drop the bID column
ab[, bID:=NULL]
Final result
ab
## aID aInfo bInfo
## 1: 1234 blue banana
## 2: 1234 blue2 banana
## 3: 3645 cerulean pomegranate
## 4: 4567 green apple
## 5: 6789 goldenrod kiwi
## 6: 321 sarcoline NA
## 7: 321 sarcoline2 NA
## 8: 63528973 NA lychee
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With