I want to return a dataframe that only shows rows where a User_ID has more than 1 Email associated to it. In other words, I am trying to count how many distinct User Ids there are that share an email - See below
Sample Data
Unnamed: 0 First Name ... User_ID Email
0 0 Bob ... 2011 Bob@email
1 1 Dirk ... 2012 jack@email
2 2 Sarah ... 2013 Sara@email
3 3 max ... 2015 Bob@email
4 4 leo ... 2016 Sara@email
From the table above, my desired outcome would be something like this (note I would drop Value Counts less than 0 as I am only interested in User IDs that have
Output
User_ID (Count of other User_Ids with same Domain)
2011 1
2012 0
2013 1
2015 1
2016 1
In SQL, this would work something like below where I would get output of all user IDs having greater than a count of 1 distinct associated emails. Can someone advise how i can do sonmething similar in python?
SELECT User_ID, COUNT(EMAILS) AS Count
FROM dataframe
HAVING Count > 1
In python I tried to do the following leveraging the value_counts function but dont know how to make it output the desired output above
df = pd.read_csv("data.csv")
#print( df['Email'].value_counts() > 1)
emailList = list(df["Email"].value_counts())
duplicates = df[df['Email'].duplicated(keep=False)]
print(duplicates.value_counts())
Are you after
df.groupby('Email')['FirstName'].value_counts()
and if you wanted to filter emails with more than 1 name. Please Try
df[df.groupby('Email')['FirstName'].transform(lambda x: x.count().sum()).gt(1)]
or
df.groupby('Email')['FirstName'].agg(list).to_frame('names')
names
Email
Bob@email [Bob, max]
Sara@email [Sarah, leo]
jack@email [Dirk]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With