Selecting rows based on the value of columns in other rows

Question

For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.

I have data consisting for individuals (ID) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)

all IDs have one and only one primary email address
IDs can have several non-primary email addresses (or none)
IDs can have the same email address as both primary and non-primary

For example:

   ID Email Primary
1   1     A       1
2   1     A       0
3   1     B       0
4   2     A       1
5   2     A       0
6   3     C       1
7   4     D       1
8   4     C       0
9   5     E       1
10  5     F       0

(The actual dataset has around half a million rows)

I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:

Primary is 0
There exists another row where that ID is Primary but for a different ID

Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID and row 8 (because it is non-primary, but primary in row 6 for a different ID) and row 2

For R users, here is the toy dataframe above:

structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))

Ronak Shah · Accepted Answer

You can select rows where

Primary = 0
number of ID's for that Email is greater than 1.
There is atleast one primary = 1 for that Email

Using dplyr, you can do this as :

library(dplyr)

df %>% 
   group_by(Email) %>% 
   filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))

#     ID Email Primary
#  <dbl> <chr>   <dbl>
#1     1 A           0
#2     2 A           0
#3     4 C           0

Since you have big data a data.table solution would be helpful :

library(data.table)

setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]

GMB · Answer

In SQL, you can use exists for this:

select t.*
  from mytable t
where t.primary = 0 
      and exists (
                  select 1 
                  from mytable t1 
                  where t1.email = t.email 
                    and t1.id <> t.id 
                    and t1.primary = 1
                  )

Selecting rows based on the value of columns in other rows

Tags:

sql

r

subquery

dplyr

where-clause

Joe King

2 Answers

Ronak Shah

GMB

Recent Activity

Donate For Us

Selecting rows based on the value of columns in other rows

Tags:

sql

r

subquery

dplyr

where-clause

Joe King

2 Answers

Ronak Shah

GMB

Related questions

Recent Activity

Donate For Us