Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas select rows where a value in a columns does not starts with a string

I have a data where I need to filter out any rows that do start with a certain values - emphasis on plural:

Below the data exactly as it appears in file data.xlsx

Name                Remains
GESDSRPPZ0161       TRUE
RT6000996           TRUE
RT6000994           TRUE
RT6000467           TRUE
RT6000431           TRUE
MCOPSR0034          FALSE
MCOPSR0033          FALSE

I need to be able to return a dataframe where name DOES NOT start with MCO, GE,etc.

import pandas as pd
import numpy as np

### data
file = r'C:\Users\user\Desktop\data.xlsx'

data  = pd.read_excel(file, na_values = '')
data['name'] = data['name'].str.upper()

prefixes = ['IM%','JE%','GE%','GV%','CHE%','MCO%']

new_data = data.select(lambda x: x not in prefixes)


new_data.shape

the last call returns exactly the same dataset as I started with.

I tried:

pandas select from Dataframe using startswith

but it excludes data if the string is elsewhere (not only starts with)

df = df[df['Column Name'].isin(['Value']) == False]

The above answer would work if I knew exactly the string in question, however it changes (the common part is MCOxxxxx, GVxxxxxx, GExxxxx...)

The vvery same happens with this one:

How to implement 'in' and 'not in' for Pandas dataframe

because the values I have to pass have to be exact. Is there any way to do with using the same logic as here (Is there any equivalent for wildcard characters like SQL?):

How do I select rows where a column value starts with a certain string?

Thanks for help! Can we expand please on the below?

@jezrael although I've chosen the other solution for simplicity (and my lack of understanding of your solution), but I'd like to ask for a bit of explanation please. What does '^' + '|^' do in this code and how is it different from Wen's solution? How does it compare performance wise when you have for loop construct as oppose to operation on series like map or apply? If I understand correctly contains() is not bothered with the location whereby startswith() specifically looks at the beggining of the string. Does it mean the ^indicates to contains() to do what? Start at the beginning? And | is it another special char for the method or is it treated like logical OR? I'd really want to learn this if you don't mind sharing. Thanks

like image 604
Bartek Malysz Avatar asked Mar 07 '18 14:03

Bartek Malysz


2 Answers

You can using startswith , the ~ in the front will convert from in to not in

prefixes = ['IM','JE','GE','GV','CHE','MCO']

df[~df.Name.str.startswith(tuple(prefixes))]
Out[424]: 
        Name  Remains
1  RT6000996     True
2  RT6000994     True
3  RT6000467     True
4  RT6000431     True
like image 99
BENY Avatar answered Oct 11 '22 07:10

BENY


Use str.contains with ^ for start of string and filter by boolean indexing:

prefixes = ['IM','JE','GE','GV','CHE','MCO']

pat = '|'.join([r'^{}'.format(x) for x in prefixes])
df = df[~df['Name'].str.contains(pat)]
print (df)
        Name  Remains
1  RT6000996     True
2  RT6000994     True
3  RT6000467     True
4  RT6000431     True

Thanks, @Zero for another solution:

df = df[~df['Name'].str.contains('^' + '|^'.join(prefixes))]
print (df)
        Name  Remains
1  RT6000996     True
2  RT6000994     True
3  RT6000467     True
4  RT6000431     True
like image 45
jezrael Avatar answered Oct 11 '22 06:10

jezrael