For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.
I have data consisting for individuals (ID
) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)
ID
s have one and only one primary email addressID
s can have several non-primary email addresses (or none)ID
s can have the same email address as both primary and non-primaryFor example:
ID Email Primary
1 1 A 1
2 1 A 0
3 1 B 0
4 2 A 1
5 2 A 0
6 3 C 1
7 4 D 1
8 4 C 0
9 5 E 1
10 5 F 0
(The actual dataset has around half a million rows)
I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:
Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID
and row 8 (because it is non-primary, but primary in row 6 for a different ID
) and row 2
For R users, here is the toy dataframe above:
structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))
You can select rows where
Primary = 0
ID
's for that Email
is greater than 1.primary = 1
for that Email
Using dplyr
, you can do this as :
library(dplyr)
df %>%
group_by(Email) %>%
filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))
# ID Email Primary
# <dbl> <chr> <dbl>
#1 1 A 0
#2 2 A 0
#3 4 C 0
Since you have big data a data.table
solution would be helpful :
library(data.table)
setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]
In SQL, you can use exists
for this:
select t.*
from mytable t
where t.primary = 0
and exists (
select 1
from mytable t1
where t1.email = t.email
and t1.id <> t.id
and t1.primary = 1
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With