This is my first time posting here, so please be kind ;-)
EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!
How can I identify records/rows in data frame x.1
that are not contained in data frame x.2
based on all attributes available (i.e. all columns) in the most efficient way?
> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
a b
1 1 1
2 1 1
3 2 99
4 3 3
5 4 4
a b
2 2 2
5 5 5
by Prof. Brian Ripley and Gabor Grothendieck
> fun.12 <- function(x.1,x.2,...){
+ x.1p <- do.call("paste", x.1)
+ x.2p <- do.call("paste", x.2)
+ x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784
A collection of all solutions tested so far is available at my blog
Here's the best solution wrapped into a function 'mergeX()':
setGeneric(
name="mergeX",
signature=c("src.1", "src.2"),
def=function(
src.1,
src.2,
...
){
standardGeneric("mergeX")
}
)
setMethod(
f="mergeX",
signature=signature(src.1="data.frame", src.2="data.frame"),
definition=function(
src.1,
src.2,
do.inverse=FALSE,
...
){
if(!do.inverse){
out <- merge(x=src.1, y=src.2, ...)
} else {
if("by.y" %in% names(list(...))){
src.2.0 <- src.2
src.2 <- src.1
src.1 <- src.2.0
}
src.1p <- do.call("paste", src.1)
src.2p <- do.call("paste", src.2)
out <- src.1[! src.1p %in% src.2p, ]
}
return(out)
}
)
Method 1: Using sqldf() Our query will be sqldf('SELECT * FROM df1 EXCEPT SELECT * FROM df2'). It will exclude all the rows from df1 that are also present in df2 and will return only rows that are only present in df1.
Pandas DataFrame is a Two-dimensional data structure of mutable size and heterogeneous tabular data. There are different Built-in data types available in Python.
frame() function in R Language is used to return TRUE if the specified data type is a data frame else return FALSE.
Set value of display. max_rows to None and pass it to set_option and this will display all rows from the data frame.
Here are a few ways. #1 and #4 assume that the rows of x.1
are unique. (If rows of x.1
are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:
# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]
# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))
# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]
# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")
EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.
What about using merge
- the simplest possible solution - I'd think it's also the fastest.
tmp = merge(x.1, cbind(x.2, myid = 1:nrow(x.2)), all.x = TRUE)
# provided that there's no column myid in both dataframes
tmp[is.na(tmp$myid), 1:ncol(x.1)] # the result
Corresponds to:
select x1.*
from x1 natural left join x2
where x2.myid is NULL
(you can also use sqldf to do that).
Note that the column myid
is added to assure that there is some column w/o NA values. If you are sure there is already some column which doesn't contain NULL values, you can simplify the solution:
tmp = merge(x.1, x.2, all.x = TRUE)
# provided that there's no column myid in both dataframes
tmp[is.na(tmp$some_column), 1:ncol(x.1)] # the result
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With