Function to find symmetric difference (opposite of intersection) in R?

Tags:

The Problem

I have two string vectors of different lengths. Each vector has a different set of strings. I want to find the strings that are in one vector but not in both; that is, the symmetric difference.

Analysis

I looked at the function setdiff, but its output depends on the order in which the vectors are considered. I found the custom function outersect, but this function requires the two vectors to be of the same length.

Any suggestions?

Correction

This issue seems to be specific to the data with which I am working. Otherwise, the answer below addresses the problem I mention in this post. I will look to see what is unique about my data and post back if I learn anything that might be helpful to other users.

839

asked Nov 05 '13 20:11

Gyan Veda

3 Answers

Why not:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))

answered Sep 20 '22 17:09

Blue Magister

Another option that is a bit faster is:

sym_diff2 <- function(a,b) unique(c(setdiff(a,b), setdiff(b,a)))

If we compare it with the answer by Blue Magister:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))

library(microbenchmark)
library(MASS)

set.seed(1)
cars1 <- sample(Cars93$Make, 70)
cars2 <- sample(Cars93$Make, 70)

microbenchmark(sym_diff(cars1, cars2), sym_diff2(cars1, cars2), times = 10000L)

>Unit: microseconds
>                  expr     min       lq     mean   median      uq      max neval
>sym_diff(cars1, cars2) 114.719 119.7785 150.7510 125.0410 131.177 12382.02 10000
>sym_diff2(cars1, cars2) 94.369 100.0205 121.6051 103.8285 109.239 12013.69 10000

identical(sym_diff(cars1, cars2), sym_diff2(cars1, cars2))
>[1] TRUE

The speed difference between these two methods increases when the samples compared are larger (thousands or more), but I couldn't find an example dataset to use with that many variables.

answered Sep 23 '22 17:09

sebpardo

Here is another symmetric difference function, this one from the definition (that can be seen, for instance, in the Wikipedia page linked to in the question).

sym_diff3 <- function(a, b) union(setdiff(a, b), setdiff(b, a))

Including the function in the test run in this other answer by user sebpardo gives approximately the same timings, a little slower. Output omitted.

identical(sym_diff(cars1, cars2), sym_diff3(cars1, cars2))
#[1] TRUE

microbenchmark(sym_diff(cars1, cars2),
               sym_diff2(cars1, cars2), 
               sym_diff3(cars1, cars2),
               times = 10000L)

answered Sep 22 '22 17:09

Rui Barradas

Related questions
                            
                                trouble installing packages in CentOS: internet routines cannot be loaded
                            
                                Forest plot for a beginner simple example using ggplot2 [edited] [closed]
                            
                                Finding the number of values above a value and less than a value in a df column without using a loop
                            
                                How to calculate mean of all columns, by group?
                            
                                R Changing Order of Facets
                            
                                Using table() in dplyr chain
                            
                                Creating a vector in R of counts for number of times each element appears in another vector
                            
                                Fill missing values with previous values by row using dplyr
                            
                                Get column names and dataframe name from a list of dataframes into a single dataframe
                            
                                R - mgsub problem: substrings being replaced not whole strings
                            
                                Is there any way to force zoo::rollmean function to return a vector that is the same length as it's input? (or maybe use other function?)
                            
                                Adding summary information to a density plot created with ggplot
                            
                                R: converting dataframe to table
                            
                                R reading a tsv file using specific encoding
                            
                                R regex gsub separate letters and numbers
                            
                                pick a random number, always with increasing value over last random number picked
                            
                                ggplot2 figure size with RMarkdown
                            
                                connecting points
                            
                                Plot multiples (time) series in R with legend
                            
                                Twitter Data Analysis - Error in Term Document Matrix

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Function to find symmetric difference (opposite of intersection) in R?

Tags:

r

set-difference

intersect

xor

symmetric-difference