Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting rows based on the value of columns in other rows

For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.

I have data consisting for individuals (ID) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)

  • all IDs have one and only one primary email address
  • IDs can have several non-primary email addresses (or none)
  • IDs can have the same email address as both primary and non-primary

For example:

   ID Email Primary
1   1     A       1
2   1     A       0
3   1     B       0
4   2     A       1
5   2     A       0
6   3     C       1
7   4     D       1
8   4     C       0
9   5     E       1
10  5     F       0

(The actual dataset has around half a million rows)

I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:

  • Primary is 0
  • There exists another row where that ID is Primary but for a different ID

Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID and row 8 (because it is non-primary, but primary in row 6 for a different ID) and row 2

For R users, here is the toy dataframe above:

structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))
like image 698
Joe King Avatar asked Jan 25 '23 17:01

Joe King


2 Answers

You can select rows where

  • Primary = 0
  • number of ID's for that Email is greater than 1.
  • There is atleast one primary = 1 for that Email

Using dplyr, you can do this as :

library(dplyr)

df %>% 
   group_by(Email) %>% 
   filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))

#     ID Email Primary
#  <dbl> <chr>   <dbl>
#1     1 A           0
#2     2 A           0
#3     4 C           0

Since you have big data a data.table solution would be helpful :

library(data.table)

setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]
like image 163
Ronak Shah Avatar answered Jan 27 '23 05:01

Ronak Shah


In SQL, you can use exists for this:

select t.*
  from mytable t
where t.primary = 0 
      and exists (
                  select 1 
                  from mytable t1 
                  where t1.email = t.email 
                    and t1.id <> t.id 
                    and t1.primary = 1
                  )
like image 34
GMB Avatar answered Jan 27 '23 05:01

GMB