Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter pandas row where 1st letter in a column is/is-not a certain value

how do I filter out a series of data (in pandas dataFrame) where I do not want the 1st letter to be 'Z', or any other character.

I have the following pandas dataFrame, df, (of which there are > 25,000 rows).

TIME_STAMP  Activity    Action  Quantity    EPIC    Price   Sub-activity    Venue
0   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
1   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
3   2017-08-30 08:00:09.000 Allocation  BUY 91  BATS    47.875  CPTY    PXINLN
4   2017-08-30 08:00:10.000 Allocation  BUY 43  PNN 8.07    CPTY    WCAPD
5   2017-08-30 08:00:10.000 Allocation  BUY 270 SGE 6.93    CPTY    PROBDMAD

I am trying to remove all the rows where the 1st letter of the Venue is 'Z'.

For example, my usual filter code would be something like (filtering out all rows where the Venue = '066'

df = df[df.Venue != '066']

I can see this filter line filters out what I need by array, but I am not sure how to specify it within a filter context.

[k for k in df.Venue if 'Z' not in k]
like image 736
Kiann Avatar asked Oct 01 '18 09:10

Kiann


2 Answers

Use str[0] for select first value or use startswith, contains with regex ^ for start of string. For invertong boolen mask is used ~:

df1 = df[df.Venue.str[0] != 'Z']

df1 = df[~df.Venue.str.startswith('Z')]

df1 = df[~df.Venue.str.contains('^Z')]

If no NaNs values faster is use list comprehension:

df1 = df[[x[0] != 'Z' for x in df.Venue]]

df1 = df[[not x.startswith('Z') for x in df.Venue]]
like image 167
jezrael Avatar answered Nov 10 '22 15:11

jezrael


For the case where you do not have NaN values, you can convert the NumPy representation of a series to type '<U1' and test equality:

df1 = df[df['A'].values.astype('<U1') != 'Z']

Performance benchmarking

from string import ascii_uppercase
from random import choice

L = [''.join(choice(ascii_uppercase) for _ in range(10)) for i in range(100000)]
df = pd.DataFrame({'A': L})

%timeit df['A'].values.astype('<U1') != 'Z'       # 4.05 ms per loop
%timeit [x[0] != 'Z' for x in df['A']]            # 11.9 ms per loop
%timeit [not x.startswith('Z') for x in df['A']]  # 23.7 ms per loop
%timeit ~df['A'].str.startswith('Z')              # 53.6 ms per loop
%timeit df['A'].str[0] != 'Z'                     # 53.7 ms per loop
%timeit ~df['A'].str.contains('^Z')               # 127 ms per loop
like image 1
jpp Avatar answered Nov 10 '22 16:11

jpp