Filter pandas row where 1st letter in a column is/is-not a certain value

Question

how do I filter out a series of data (in pandas dataFrame) where I do not want the 1st letter to be 'Z', or any other character.

I have the following pandas dataFrame, df, (of which there are > 25,000 rows).

TIME_STAMP  Activity    Action  Quantity    EPIC    Price   Sub-activity    Venue
0   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
1   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
3   2017-08-30 08:00:09.000 Allocation  BUY 91  BATS    47.875  CPTY    PXINLN
4   2017-08-30 08:00:10.000 Allocation  BUY 43  PNN 8.07    CPTY    WCAPD
5   2017-08-30 08:00:10.000 Allocation  BUY 270 SGE 6.93    CPTY    PROBDMAD

I am trying to remove all the rows where the 1st letter of the Venue is 'Z'.

For example, my usual filter code would be something like (filtering out all rows where the Venue = '066'

df = df[df.Venue != '066']

I can see this filter line filters out what I need by array, but I am not sure how to specify it within a filter context.

[k for k in df.Venue if 'Z' not in k]

jezrael · Accepted Answer

Use str[0] for select first value or use startswith, contains with regex ^ for start of string. For invertong boolen mask is used ~:

df1 = df[df.Venue.str[0] != 'Z']

df1 = df[~df.Venue.str.startswith('Z')]

df1 = df[~df.Venue.str.contains('^Z')]

If no NaNs values faster is use list comprehension:

df1 = df[[x[0] != 'Z' for x in df.Venue]]

df1 = df[[not x.startswith('Z') for x in df.Venue]]

jpp · Answer

For the case where you do not have NaN values, you can convert the NumPy representation of a series to type '<U1' and test equality:

df1 = df[df['A'].values.astype('<U1') != 'Z']

Performance benchmarking

from string import ascii_uppercase
from random import choice

L = [''.join(choice(ascii_uppercase) for _ in range(10)) for i in range(100000)]
df = pd.DataFrame({'A': L})

%timeit df['A'].values.astype('<U1') != 'Z'       # 4.05 ms per loop
%timeit [x[0] != 'Z' for x in df['A']]            # 11.9 ms per loop
%timeit [not x.startswith('Z') for x in df['A']]  # 23.7 ms per loop
%timeit ~df['A'].str.startswith('Z')              # 53.6 ms per loop
%timeit df['A'].str[0] != 'Z'                     # 53.7 ms per loop
%timeit ~df['A'].str.contains('^Z')               # 127 ms per loop

Filter pandas row where 1st letter in a column is/is-not a certain value

Tags:

python

python-3.x

pandas

dataframe

filter

Kiann

2 Answers

jezrael

Performance benchmarking

jpp

Recent Activity

Donate For Us

Filter pandas row where 1st letter in a column is/is-not a certain value

Tags:

python

python-3.x

pandas

dataframe

filter

Kiann

2 Answers

jezrael

Performance benchmarking

jpp

Related questions

Recent Activity

Donate For Us