I have a data where I need to filter out any rows that do start with a certain values - emphasis on plural:
Below the data exactly as it appears in file data.xlsx
Name Remains
GESDSRPPZ0161 TRUE
RT6000996 TRUE
RT6000994 TRUE
RT6000467 TRUE
RT6000431 TRUE
MCOPSR0034 FALSE
MCOPSR0033 FALSE
I need to be able to return a dataframe where name DOES NOT start with MCO, GE,etc.
import pandas as pd
import numpy as np
### data
file = r'C:\Users\user\Desktop\data.xlsx'
data = pd.read_excel(file, na_values = '')
data['name'] = data['name'].str.upper()
prefixes = ['IM%','JE%','GE%','GV%','CHE%','MCO%']
new_data = data.select(lambda x: x not in prefixes)
new_data.shape
the last call returns exactly the same dataset as I started with.
I tried:
pandas select from Dataframe using startswith
but it excludes data if the string is elsewhere (not only starts with)
df = df[df['Column Name'].isin(['Value']) == False]
The above answer would work if I knew exactly the string in question, however it changes (the common part is MCOxxxxx, GVxxxxxx, GExxxxx...)
The vvery same happens with this one:
How to implement 'in' and 'not in' for Pandas dataframe
because the values I have to pass have to be exact. Is there any way to do with using the same logic as here (Is there any equivalent for wildcard characters like SQL?):
How do I select rows where a column value starts with a certain string?
Thanks for help! Can we expand please on the below?
@jezrael although I've chosen the other solution for simplicity (and my lack of understanding of your solution), but I'd like to ask for a bit of explanation please. What does '^' + '|^' do in this code and how is it different from Wen's solution? How does it compare performance wise when you have for loop construct as oppose to operation on series like map or apply? If I understand correctly contains() is not bothered with the location whereby startswith() specifically looks at the beggining of the string. Does it mean the ^indicates to contains() to do what? Start at the beginning? And | is it another special char for the method or is it treated like logical OR? I'd really want to learn this if you don't mind sharing. Thanks
You can using startswith
, the ~
in the front will convert from in to not in
prefixes = ['IM','JE','GE','GV','CHE','MCO']
df[~df.Name.str.startswith(tuple(prefixes))]
Out[424]:
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Use str.contains
with ^
for start of string and filter by boolean indexing
:
prefixes = ['IM','JE','GE','GV','CHE','MCO']
pat = '|'.join([r'^{}'.format(x) for x in prefixes])
df = df[~df['Name'].str.contains(pat)]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
Thanks, @Zero for another solution:
df = df[~df['Name'].str.contains('^' + '|^'.join(prefixes))]
print (df)
Name Remains
1 RT6000996 True
2 RT6000994 True
3 RT6000467 True
4 RT6000431 True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With