I have a DataFrame where I want to drop elements depending on their index name <pre class="prettyprint"><code> col1 col2 entry_1 10 11 entry_2_test 12 13 entry_3 14 15 entry_4_test 16 17 </code></pre> Basically I want to drop the ones ending with _test I know how to select them: <pre class="prettyprint"><code>df.filter(like='_test', axis=0) col1 col2 entry_2_test 12 13 entry_4_test 16 17 </code></pre> Then I can actually get those indexes: <pre class="prettyprint"><code>df.filter(like='_test', axis=0).index entry_2_test entry_4_test </code></pre> And finally I can drop those indexes and overwrite my dataframe with the filtered one. <pre class="prettyprint"><code>df = df.drop(df.filter(like='_test', axis=0).index) df col1 col2 entry_1 10 11 entry_3 14 15 </code></pre> My question is if this is the correct way of filtering or there's a more efficient dedicated function to do this?

You can invert the result of <code>str.endswith</code>: <pre class="prettyprint"><code>In[13]: df.loc[~df.index.str.endswith('_test')] Out[13]: col1 col2 entry_1 10 11 entry_3 14 15 </code></pre> Alternatively slice the last 5 characters and do a comparison using <code>!=</code>: <pre class="prettyprint"><code>In[13]: df.loc[df.index.str[-5:]!='_test'] Out[18]: col1 col2 entry_1 10 11 entry_3 14 15 </code></pre> It's still possible to use <code>filter</code> by passing a regex pattern to filter out the rows that don't end with <code>'_test'</code>: <pre class="prettyprint"><code>In[25]: df.filter(regex='.*[^_test]$', axis=0) Out[25]: col1 col2 entry_1 10 11 entry_3 14 15 </code></pre> As pointed out by @user3483203 it's better to use the following regex: <pre class="prettyprint"><code>df.filter(regex='.*(?<!_test)$', axis=0) </code></pre>

With <code>filter</code> <code>regex</code> <pre class="prettyprint"><code>df.filter(regex='.*[^_test]$', axis=0) Out[274]: col1 col2 entry_1 10 11 entry_3 14 15 </code></pre>

You can use a list comprehension and feed a list of Boolean values to <code>pd.DataFrame.loc</code>. While this may seem anti-pattern, it's actually more efficient as Pandas string methods are not particularly optimised: <pre class="prettyprint"><code>df2 = pd.concat([df]*10000) %timeit df2.loc[[i[-5:] == '_test' for i in df2.index]] # 11.7 ms per loop %timeit df2.loc[[i.endswith('_test') for i in df2.index]] # 13.3 ms per loop %timeit df2[~(df2.index.str[-5:] == '_test')] # 22.1 ms per loop %timeit df2[~df2.index.str.endswith('_test')] # 21.7 ms per loop </code></pre>

Filter dataframe rows by index name

I have a DataFrame where I want to drop elements depending on their index name

               col1  col2
entry_1          10    11
entry_2_test     12    13
entry_3          14    15
entry_4_test     16    17

Basically I want to drop the ones ending with _test

I know how to select them:

df.filter(like='_test', axis=0)

               col1  col2
entry_2_test     12    13
entry_4_test     16    17

Then I can actually get those indexes:

df.filter(like='_test', axis=0).index

entry_2_test
entry_4_test

And finally I can drop those indexes and overwrite my dataframe with the filtered one.

df = df.drop(df.filter(like='_test', axis=0).index)
df

               col1  col2
entry_1          10    11
entry_3          14    15

My question is if this is the correct way of filtering or there's a more efficient dedicated function to do this?

How do I filter specific rows from a DataFrame?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you sort a DataFrame by index?

To sort a Pandas DataFrame by index, you can use DataFrame. sort_index() method. To specify whether the method has to sort the DataFrame in ascending or descending order of index, you can set the named boolean argument ascending to True or False respectively. When the index is sorted, respective rows are rearranged.

You can invert the result of str.endswith:

In[13]:
df.loc[~df.index.str.endswith('_test')]

Out[13]: 
         col1  col2
entry_1    10    11
entry_3    14    15

Alternatively slice the last 5 characters and do a comparison using !=:

In[13]:
df.loc[df.index.str[-5:]!='_test']

Out[18]: 
         col1  col2
entry_1    10    11
entry_3    14    15

It's still possible to use filter by passing a regex pattern to filter out the rows that don't end with '_test':

In[25]:
df.filter(regex='.*[^_test]$', axis=0)

Out[25]: 
         col1  col2
entry_1    10    11
entry_3    14    15

As pointed out by @user3483203 it's better to use the following regex:

df.filter(regex='.*(?<!_test)$', axis=0)

With filter regex

df.filter(regex='.*[^_test]$', axis=0)
Out[274]: 
         col1  col2
entry_1    10    11
entry_3    14    15

You can use a list comprehension and feed a list of Boolean values to pd.DataFrame.loc.

While this may seem anti-pattern, it's actually more efficient as Pandas string methods are not particularly optimised:

df2 = pd.concat([df]*10000)

%timeit df2.loc[[i[-5:] == '_test' for i in df2.index]]    # 11.7 ms per loop
%timeit df2.loc[[i.endswith('_test') for i in df2.index]]  # 13.3 ms per loop
%timeit df2[~(df2.index.str[-5:] == '_test')]              # 22.1 ms per loop
%timeit df2[~df2.index.str.endswith('_test')]              # 21.7 ms per loop

Filter dataframe rows by index name

Tags:

python

pandas

dataframe

Sembei Norimaki

People also ask

3 Answers

EdChum

BENY

jpp

Recent Activity

Donate For Us

Filter dataframe rows by index name

Tags:

python

pandas

dataframe

Sembei Norimaki

People also ask

3 Answers

EdChum

BENY

jpp

Related questions

Recent Activity

Donate For Us