Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to find symmetric difference (opposite of intersection) in R?

The Problem

I have two string vectors of different lengths. Each vector has a different set of strings. I want to find the strings that are in one vector but not in both; that is, the symmetric difference.

Analysis

I looked at the function setdiff, but its output depends on the order in which the vectors are considered. I found the custom function outersect, but this function requires the two vectors to be of the same length.

Any suggestions?

Correction

This issue seems to be specific to the data with which I am working. Otherwise, the answer below addresses the problem I mention in this post. I will look to see what is unique about my data and post back if I learn anything that might be helpful to other users.

like image 839
Gyan Veda Avatar asked Nov 05 '13 20:11

Gyan Veda


People also ask

What is the opposite of the Intersect function in R?

outersect(): The opposite of R's intersect() functionsetdiff() produces all elements of the first input vector without any matching elements from the second input vector (i.e. is asymmetric).

How do you find the intersection of two sets in R?

intersect() function in R Language is used to find the intersection of two Objects. This function takes two objects like Vectors, dataframes, etc. as arguments and results in a third object with the common data of both the objects.

Does R have sets?

R comes with several built-in data sets, which are generally used as demo data for playing with R functions.


3 Answers

Why not:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))
like image 98
Blue Magister Avatar answered Sep 20 '22 17:09

Blue Magister


Another option that is a bit faster is:

sym_diff2 <- function(a,b) unique(c(setdiff(a,b), setdiff(b,a)))

If we compare it with the answer by Blue Magister:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))

library(microbenchmark)
library(MASS)

set.seed(1)
cars1 <- sample(Cars93$Make, 70)
cars2 <- sample(Cars93$Make, 70)

microbenchmark(sym_diff(cars1, cars2), sym_diff2(cars1, cars2), times = 10000L)

>Unit: microseconds
>                  expr     min       lq     mean   median      uq      max neval
>sym_diff(cars1, cars2) 114.719 119.7785 150.7510 125.0410 131.177 12382.02 10000
>sym_diff2(cars1, cars2) 94.369 100.0205 121.6051 103.8285 109.239 12013.69 10000

identical(sym_diff(cars1, cars2), sym_diff2(cars1, cars2))
>[1] TRUE

The speed difference between these two methods increases when the samples compared are larger (thousands or more), but I couldn't find an example dataset to use with that many variables.

like image 27
sebpardo Avatar answered Sep 23 '22 17:09

sebpardo


Here is another symmetric difference function, this one from the definition (that can be seen, for instance, in the Wikipedia page linked to in the question).

sym_diff3 <- function(a, b) union(setdiff(a, b), setdiff(b, a))

Including the function in the test run in this other answer by user sebpardo gives approximately the same timings, a little slower. Output omitted.

identical(sym_diff(cars1, cars2), sym_diff3(cars1, cars2))
#[1] TRUE

microbenchmark(sym_diff(cars1, cars2),
               sym_diff2(cars1, cars2), 
               sym_diff3(cars1, cars2),
               times = 10000L)
like image 28
Rui Barradas Avatar answered Sep 22 '22 17:09

Rui Barradas