Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pandas to Filter String In Cell with Multiple Values

I am using pandas to filter a data frame using str.contains() but my logic is dropping values that I might want to keep given the string. I don't know how to use Pandas to sort this out.

A sample cell in the excel sheet that I am working with would look like:

Case #1: Don't flag this because there is a different recipient, [email protected]

Recipient
---------
[email protected], [email protected], [email protected]

Case #2: Flag this because every recipient contains @work.com

Recipient
---------
[email protected], [email protected], [email protected]

I have a situation where I only need it to filter if a specific value occurs. For example, if 'Recipient' contains the email [email protected], drop this value. But if Recipient column contains '[email protected], [email protected]' (Yes, the values are separated in a comma like that in a single cell.) and keep it. Eventually, this dataframe will be dropped from a final report. So I want to drop everything that just contains @work.com, but don't drop if it contains a @gmail.com, @work.com.

This query below is dropping everything even if the Recipient column contains 'gmail.com'

df['EMAIL10'] = df['Type'].str.contains('Email') & df['Type'].str.contains(
                'Tracking | Data') & df[
                                'Recipient'].str.contains('@work.com') 

Let me know if I need to clarify

like image 588
khabaz Avatar asked Dec 07 '22 10:12

khabaz


1 Answers

You can create a Boolean Mask that indicates whether or not all separate words contain '@work'.

First, split so that each word is placed into a separate cell, and explode will turn this into one big Series, with the index duplicated and pointing back to the index of your original DataFrame. .str.contains checks your condition and all(level=0) checks whether it's True for every word in a row from your original DataFrame.

import pandas as pd

df = pd.DataFrame({'col': ['[email protected], [email protected], [email protected]', 
                           '[email protected], [email protected], [email protected]']})

df['all_work'] = df['col'].str.split(', ').explode().str.contains('@work').all(level=0)

print(df)
                                               col  all_work
0      [email protected], [email protected], [email protected]     False
1  [email protected], [email protected], [email protected]      True

For explanation, after split and explode we have:

df['col'].str.split(', ').explode()

 0       [email protected] 
 0      [email protected]   # Each item split separately
 0     [email protected]
 1      [email protected]
 1    [email protected]
 1    [email protected]
#|
#Index corresponds to Index of the original DataFrame
like image 88
ALollz Avatar answered Feb 15 '23 19:02

ALollz