Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter dataframe rows by index name

I have a DataFrame where I want to drop elements depending on their index name

               col1  col2
entry_1          10    11
entry_2_test     12    13
entry_3          14    15
entry_4_test     16    17

Basically I want to drop the ones ending with _test

I know how to select them:

df.filter(like='_test', axis=0)

               col1  col2
entry_2_test     12    13
entry_4_test     16    17

Then I can actually get those indexes:

df.filter(like='_test', axis=0).index

entry_2_test
entry_4_test

And finally I can drop those indexes and overwrite my dataframe with the filtered one.

df = df.drop(df.filter(like='_test', axis=0).index)
df

               col1  col2
entry_1          10    11
entry_3          14    15

My question is if this is the correct way of filtering or there's a more efficient dedicated function to do this?

like image 204
Sembei Norimaki Avatar asked Aug 07 '18 15:08

Sembei Norimaki


People also ask

How do I filter specific rows from a DataFrame?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you sort a DataFrame by index?

To sort a Pandas DataFrame by index, you can use DataFrame. sort_index() method. To specify whether the method has to sort the DataFrame in ascending or descending order of index, you can set the named boolean argument ascending to True or False respectively. When the index is sorted, respective rows are rearranged.


3 Answers

You can invert the result of str.endswith:

In[13]:
df.loc[~df.index.str.endswith('_test')]

Out[13]: 
         col1  col2
entry_1    10    11
entry_3    14    15

Alternatively slice the last 5 characters and do a comparison using !=:

In[13]:
df.loc[df.index.str[-5:]!='_test']

Out[18]: 
         col1  col2
entry_1    10    11
entry_3    14    15

It's still possible to use filter by passing a regex pattern to filter out the rows that don't end with '_test':

In[25]:
df.filter(regex='.*[^_test]$', axis=0)

Out[25]: 
         col1  col2
entry_1    10    11
entry_3    14    15

As pointed out by @user3483203 it's better to use the following regex:

df.filter(regex='.*(?<!_test)$', axis=0)
like image 62
EdChum Avatar answered Oct 07 '22 05:10

EdChum


With filter regex

df.filter(regex='.*[^_test]$', axis=0)
Out[274]: 
         col1  col2
entry_1    10    11
entry_3    14    15
like image 36
BENY Avatar answered Oct 07 '22 05:10

BENY


You can use a list comprehension and feed a list of Boolean values to pd.DataFrame.loc.

While this may seem anti-pattern, it's actually more efficient as Pandas string methods are not particularly optimised:

df2 = pd.concat([df]*10000)

%timeit df2.loc[[i[-5:] == '_test' for i in df2.index]]    # 11.7 ms per loop
%timeit df2.loc[[i.endswith('_test') for i in df2.index]]  # 13.3 ms per loop
%timeit df2[~(df2.index.str[-5:] == '_test')]              # 22.1 ms per loop
%timeit df2[~df2.index.str.endswith('_test')]              # 21.7 ms per loop
like image 30
jpp Avatar answered Oct 07 '22 06:10

jpp