Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify records in data frame A not contained in data frame B [closed]

This is my first time posting here, so please be kind ;-)

EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!

QUESTION

How can I identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

EXAMPLE DATA

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
  a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
  a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

DESIRED RESULT

  a b
2 2 2
5 5 5

BEST SOLUTION SO FAR

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784

A collection of all solutions tested so far is available at my blog

FINAL EDIT 2011-10-14

Here's the best solution wrapped into a function 'mergeX()':

setGeneric(
    name="mergeX",
    signature=c("src.1", "src.2"),
    def=function(
        src.1,
        src.2,
        ...
    ){
    standardGeneric("mergeX")    
    }
)

setMethod(
    f="mergeX", 
    signature=signature(src.1="data.frame", src.2="data.frame"), 
    definition=function(
        src.1,
        src.2,
        do.inverse=FALSE,
        ...
    ){
    if(!do.inverse){
        out <- merge(x=src.1, y=src.2, ...)
    } else {
        if("by.y" %in% names(list(...))){
            src.2.0 <- src.2
            src.2 <- src.1
            src.1 <- src.2.0
        }
        src.1p <- do.call("paste", src.1)
        src.2p <- do.call("paste", src.2)
        out <- src.1[! src.1p %in% src.2p, ]
    }
    return(out)    
    }
)
like image 904
Rappster Avatar asked Oct 11 '11 15:10

Rappster


People also ask

How do you find rows from one DataFrame is not in another?

Method 1: Using sqldf() Our query will be sqldf('SELECT * FROM df1 EXCEPT SELECT * FROM df2'). It will exclude all the rows from df1 that are also present in df2 and will return only rows that are only present in df1.

What data type is a data frame?

Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular data. There are different Built-in data types available in Python.

How do you check if a variable is a data frame or not?

frame() function in R Language is used to return TRUE if the specified data type is a data frame else return FALSE.

How do you show all records in a data frame?

Set value of display. max_rows to None and pass it to set_option and this will display all rows from the data frame.


2 Answers

Here are a few ways. #1 and #4 assume that the rows of x.1 are unique. (If rows of x.1 are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:

# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]

# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))

# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]

# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")

EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.

like image 72
G. Grothendieck Avatar answered Nov 03 '22 05:11

G. Grothendieck


What about using merge - the simplest possible solution - I'd think it's also the fastest.

tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result

Corresponds to:

select x1.* 
from x1 natural left join x2 
where x2.myid is NULL

(you can also use sqldf to do that).

Note that the column myid is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:

tmp = merge(x.1, x.2, all.x = TRUE)
    # provided that there's no column myid in both dataframes
tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result
like image 45
Tomas Avatar answered Nov 03 '22 07:11

Tomas