<p>This is my first time posting here, so please be kind ;-)</p> <p><strong>EDIT</strong> My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!</p> <h3>QUESTION</h3> <p>How can I identify records/rows in data frame <code>x.1</code> that are <em>not</em> contained in data frame <code>x.2</code> based on <em>all</em> attributes available (i.e. all columns) in the <em>most efficient</em> way?</p> <h3><strong>EXAMPLE DATA</strong></h3> <pre class="prettyprint"><code>> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5)) > x.1 a b 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 > x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4)) > x.2 a b 1 1 1 2 1 1 3 2 99 4 3 3 5 4 4 </code></pre> <h3>DESIRED RESULT</h3> <pre class="prettyprint"><code> a b 2 2 2 5 5 5 </code></pre> <h3>BEST SOLUTION SO FAR</h3> <p>by Prof. Brian Ripley and Gabor Grothendieck</p> <pre class="prettyprint"><code>> fun.12 <- function(x.1,x.2,...){ + x.1p <- do.call("paste", x.1) + x.2p <- do.call("paste", x.2) + x.1[! x.1p %in% x.2p, ] + } > fun.12(x.1,x.2) a b 2 2 2 5 5 5 > sol.12 <- microbenchmark(fun.12(x.1,x.2)) > sol.12 <- median(sol.12$time)/1000000000 > sol.12 > [1] 0.000207784 </code></pre> <p><strong>A collection of all solutions tested so far is available at my blog</strong></p> <h3>FINAL EDIT 2011-10-14</h3> <p>Here's the best solution wrapped into a function 'mergeX()':</p> <pre class="prettyprint"><code>setGeneric( name="mergeX", signature=c("src.1", "src.2"), def=function( src.1, src.2, ... ){ standardGeneric("mergeX") } ) setMethod( f="mergeX", signature=signature(src.1="data.frame", src.2="data.frame"), definition=function( src.1, src.2, do.inverse=FALSE, ... ){ if(!do.inverse){ out <- merge(x=src.1, y=src.2, ...) } else { if("by.y" %in% names(list(...))){ src.2.0 <- src.2 src.2 <- src.1 src.1 <- src.2.0 } src.1p <- do.call("paste", src.1) src.2p <- do.call("paste", src.2) out <- src.1[! src.1p %in% src.2p, ] } return(out) } ) </code></pre>

<p>What about using <code>merge</code> - the simplest possible solution - I'd think it's also the fastest.</p> <pre class="prettyprint"><code>tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE) # provided that there's no column myid in both dataframes tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result </code></pre> <p>Corresponds to:</p> <pre class="prettyprint"><code>select x1.* from x1 natural left join x2 where x2.myid is NULL </code></pre> <p>(you can also use sqldf to do that).</p> <p>Note that the column <code>myid</code> is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:</p> <pre class="prettyprint"><code>tmp = merge(x.1, x.2, all.x = TRUE) # provided that there's no column myid in both dataframes tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result </code></pre>

Identify records in data frame A not contained in data frame B [closed]

QUESTION

How can I identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

EXAMPLE DATA

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
  a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
  a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

DESIRED RESULT

  a b
2 2 2
5 5 5

BEST SOLUTION SO FAR

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784

A collection of all solutions tested so far is available at my blog

FINAL EDIT 2011-10-14

Here's the best solution wrapped into a function 'mergeX()':

setGeneric(
    name="mergeX",
    signature=c("src.1", "src.2"),
    def=function(
        src.1,
        src.2,
        ...
    ){
    standardGeneric("mergeX")    
    }
)

setMethod(
    f="mergeX", 
    signature=signature(src.1="data.frame", src.2="data.frame"), 
    definition=function(
        src.1,
        src.2,
        do.inverse=FALSE,
        ...
    ){
    if(!do.inverse){
        out <- merge(x=src.1, y=src.2, ...)
    } else {
        if("by.y" %in% names(list(...))){
            src.2.0 <- src.2
            src.2 <- src.1
            src.1 <- src.2.0
        }
        src.1p <- do.call("paste", src.1)
        src.2p <- do.call("paste", src.2)
        out <- src.1[! src.1p %in% src.2p, ]
    }
    return(out)    
    }
)

904

asked Oct 11 '11 15:10

Rappster

2 Answers

Here are a few ways. #1 and #4 assume that the rows of x.1 are unique. (If rows of x.1 are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:

# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]

# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))

# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]

# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")

EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.

answered Nov 03 '22 05:11

G. Grothendieck

What about using merge - the simplest possible solution - I'd think it's also the fastest.

tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result

Corresponds to:

select x1.* 
from x1 natural left join x2 
where x2.myid is NULL

(you can also use sqldf to do that).

Note that the column myid is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:

tmp = merge(x.1, x.2, all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result

answered Nov 03 '22 07:11

Tomas

Related questions
                            
                                Specifying column with its index rather than name
                            
                                Width of error bars in line plot using ggplot2
                            
                                Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
                            
                                Combine two lists of dataframes, dataframe by dataframe
                            
                                RDA, Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric, when data is numeric?
                            
                                What affects the time to create a cluster using the parallel package?
                            
                                Complete partially filled in columns, based on established relationships between columns
                            
                                Reload .Renviron or .Rprofile from an active R session (without restarting R)?
                            
                                Remove border lines in ggplot map/choropleth
                            
                                What is the purpose of .*\\?
                            
                                How to configure R-3.1.2 with --enable-R-shlib
                            
                                How do I sweep specific columns with dplyr?
                            
                                Where do absent dots (`...`) get processed?
                            
                                Prevent zooming out in leaflet R-Map?
                            
                                Equivalent of rowwise() do() with purrr, now that by_row() is deprecated?
                            
                                Error in bind_rows_(x, .id) : Argument 1 must have names using map_df in purrr
                            
                                Plotly plot doesn't render within for loop of RMarkdown document
                            
                                3D scatterplot using custom image
                            
                                Minimise cost of reallocating individuals
                            
                                Using MySQL in R for Windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Identify records in data frame A not contained in data frame B [closed]

Tags:

merge

join

dataframe

r

match