I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other. I have made a minimal reproducible example. <pre class="prettyprint"><code>a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean")) b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee")) </code></pre> using merge gets part of the way: <pre class="prettyprint"><code>merge(a,b, by.x="aID", by.y="bID", all=TRUE) aID aInfo bInfo 1 1234 blue <NA> 2 3645 cerulean <NA> 3 4567 green apple 4 6789 goldenrod kiwi 5 (1234) <NA> banana 6 23645 <NA> pomegranate 7 63528973 <NA> lychee </code></pre> but the output that would be liked is basically: <pre class="prettyprint"><code> ID aInfo bInfo 1 1234 blue banana 2 3645 cerulean pomegranate 3 4567 green apple 4 6789 goldenrod kiwi 5 63528973 <NA> lychee </code></pre> I just wondered if there was a way to incorporate grep into this or another R-tastic method? Thanks in advance

This is an answer using <code>data.table</code>, inspired by @nograpes. <pre class="prettyprint"><code>## Create example tables; I added the sarcoline cases ## so there would be examples of rows in a but not b a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"), aInfo=c("blue","blue2","green","goldenrod","cerulean", "sarcoline","sarcoline2"), key="aID") b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"), key="bID") ## Use agrep to get the rows of b by each aID from a ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)] ab ## aID aInfo bID bInfo ## 1: 1234 blue (1234) banana ## 2: 1234 blue2 (1234) banana ## 3: 3645 cerulean 23645 pomegranate ## 4: 4567 green 4567 apple ## 5: 6789 goldenrod 6789 kiwi </code></pre> So far we've only got an inner join, so now let's add the unmatched rows from the original tables: <pre class="prettyprint"><code>ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE) </code></pre> These steps are optional and are included to match the output from the OP: <pre class="prettyprint"><code>## Update NA values of aID with the value from bID ab[is.na(aID), aID:=bID] ## Drop the bID column ab[, bID:=NULL] </code></pre> Final result <pre class="prettyprint"><code>ab ## aID aInfo bInfo ## 1: 1234 blue banana ## 2: 1234 blue2 banana ## 3: 3645 cerulean pomegranate ## 4: 4567 green apple ## 5: 6789 goldenrod kiwi ## 6: 321 sarcoline NA ## 7: 321 sarcoline2 NA ## 8: 63528973 NA lychee </code></pre>

R merge data frames, allow inexact ID matching (e.g. with additional characters 1234 matches ab1234 )

Tags:

regex

merge

dataframe

r

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.

I have made a minimal reproducible example.

a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

using merge gets part of the way:

merge(a,b, by.x="aID", by.y="bID", all=TRUE)

       aID     aInfo       bInfo
1     1234      blue        <NA>
2     3645  cerulean        <NA>
3     4567     green       apple
4     6789 goldenrod        kiwi
5   (1234)      <NA>      banana
6    23645      <NA> pomegranate
7 63528973      <NA>      lychee

but the output that would be liked is basically:

        ID     aInfo       bInfo
1     1234      blue      banana
2     3645  cerulean pomegranate
3     4567     green       apple
4     6789 goldenrod        kiwi
5 63528973      <NA>      lychee

I just wondered if there was a way to incorporate grep into this or another R-tastic method?

Thanks in advance

870

asked Jan 16 '14 14:01

bioinformagician

2 Answers

Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

# I slightly modified your data to test multiple matches    
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))

#         aID     aInfo    bID       bInfo
# 1234.1 1234      blue (1234)      banana
# 1234.2 1234     blue2 (1234)      banana
# 3645   3645  cerulean  23645 pomegranate
# 4567   4567     green   4567       apple
# 6789   6789 goldenrod   6789        kiwi

Doing a full join is a little trickier. This is one way, that is still inefficient:

f<-function(x,b) {
  matches<-b[agrep(x[1,1],b[,1]),]
  if (nrow(matches)>0) merge(x,matches,all=TRUE)
  # Ugly... but how else to create a data.frame full of NAs?
  else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])

#         aID     aInfo      bID       bInfo
# 1234.1 1234      blue   (1234)      banana
# 1234.2 1234     blue2   (1234)      banana
# 3645   3645  cerulean    23645 pomegranate
# 4567   4567     green     4567       apple
# 6789   6789 goldenrod     6789        kiwi
# bID    <NA>      <NA> 63528973      lychee

197

answered Oct 18 '22 04:10

nograpes

This is an answer using data.table, inspired by @nograpes.

## Create example tables; I added the sarcoline cases
##   so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
                aInfo=c("blue","blue2","green","goldenrod","cerulean",
                        "sarcoline","sarcoline2"),
                key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
                bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
                key="bID")

## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
##     aID     aInfo    bID       bInfo
## 1: 1234      blue (1234)      banana
## 2: 1234     blue2 (1234)      banana
## 3: 3645  cerulean  23645 pomegranate
## 4: 4567     green   4567       apple
## 5: 6789 goldenrod   6789        kiwi

So far we've only got an inner join, so now let's add the unmatched rows from the original tables:

ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)

These steps are optional and are included to match the output from the OP:

## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]

## Drop the bID column
ab[, bID:=NULL]

Final result

ab
##         aID      aInfo       bInfo
## 1:     1234       blue      banana
## 2:     1234      blue2      banana
## 3:     3645   cerulean pomegranate
## 4:     4567      green       apple
## 5:     6789  goldenrod        kiwi
## 6:      321  sarcoline          NA
## 7:      321 sarcoline2          NA
## 8: 63528973         NA      lychee

answered Oct 18 '22 04:10

dnlbrky

Related questions
                            
                                parallel coord plot ggplot2 deprecated?
                            
                                Read only data slot from shapefile (R)?
                            
                                How to convert a wave file into numbers in R Studio
                            
                                how to change line length on Rterm.exe
                            
                                Accessing columns in data.table using a character vector of column names
                            
                                Log of install.packages()
                            
                                ggplot aes_string with interaction
                            
                                Error in getOctD(x, offset, len) : invalid octal digit
                            
                                ddply to multiple columns equivalent in data.table
                            
                                How to extract R code from a Markdown file (*.md)?
                            
                                What is the most elegant way to calculate seasonal means with R?
                            
                                How can I load a server's specific R installation (environment module) when launching a local installation of emacs?
                            
                                R Shiny, Remove Within-Column Filters from DataTables
                            
                                How to only change parameters for "lower" plots in the ggpairs function from GGally package
                            
                                Using different scales as fill based on factor
                            
                                Display symbols in browser via shiny package R
                            
                                How to get more than one background color on ggplot2 plot area?
                            
                                Expand top of scale/axis to include text
                            
                                Can markdown expressions and results be interleaved in the same block?
                            
                                store and retrieve matrices in memory using xptr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With