Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge data frames based on non-identical values in R

Tags:

merge

r

I have two data frames. First one looks like

dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1",   "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ",  "chr15:35086890..35086919", "2")

where chr15:35086890..35086919 indicates all the numbers within this range.

The second looks like:

dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT","     FUNC")
dat2[1,] <- c("chr1:116242719",   "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855",  "A/G", "intergenic")

I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?

like image 507
cianius Avatar asked Dec 09 '25 06:12

cianius


1 Answers

Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:

library(plyr)

exploded.dat <- adply(dat, 1, function(x){
    parts <- strsplit(x$Pos, ":")[[1]]
    chr   <- parts[1]
    range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
    start <- range[1]
    end   <- range[2]
    data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})

merge(dat2, exploded.dat, by = "VAR")

If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.

like image 169
flodel Avatar answered Dec 10 '25 22:12

flodel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!