I have a DataFrame where I want to drop elements depending on their index name
col1 col2
entry_1 10 11
entry_2_test 12 13
entry_3 14 15
entry_4_test 16 17
Basically I want to drop the ones ending with _test
I know how to select them:
df.filter(like='_test', axis=0)
col1 col2
entry_2_test 12 13
entry_4_test 16 17
Then I can actually get those indexes:
df.filter(like='_test', axis=0).index
entry_2_test
entry_4_test
And finally I can drop those indexes and overwrite my dataframe with the filtered one.
df = df.drop(df.filter(like='_test', axis=0).index)
df
col1 col2
entry_1 10 11
entry_3 14 15
My question is if this is the correct way of filtering or there's a more efficient dedicated function to do this?
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
To sort a Pandas DataFrame by index, you can use DataFrame. sort_index() method. To specify whether the method has to sort the DataFrame in ascending or descending order of index, you can set the named boolean argument ascending to True or False respectively. When the index is sorted, respective rows are rearranged.
You can invert the result of str.endswith
:
In[13]:
df.loc[~df.index.str.endswith('_test')]
Out[13]:
col1 col2
entry_1 10 11
entry_3 14 15
Alternatively slice the last 5 characters and do a comparison using !=
:
In[13]:
df.loc[df.index.str[-5:]!='_test']
Out[18]:
col1 col2
entry_1 10 11
entry_3 14 15
It's still possible to use filter
by passing a regex pattern to filter out the rows that don't end with '_test'
:
In[25]:
df.filter(regex='.*[^_test]$', axis=0)
Out[25]:
col1 col2
entry_1 10 11
entry_3 14 15
As pointed out by @user3483203 it's better to use the following regex:
df.filter(regex='.*(?<!_test)$', axis=0)
With filter
regex
df.filter(regex='.*[^_test]$', axis=0)
Out[274]:
col1 col2
entry_1 10 11
entry_3 14 15
You can use a list comprehension and feed a list of Boolean values to pd.DataFrame.loc
.
While this may seem anti-pattern, it's actually more efficient as Pandas string methods are not particularly optimised:
df2 = pd.concat([df]*10000)
%timeit df2.loc[[i[-5:] == '_test' for i in df2.index]] # 11.7 ms per loop
%timeit df2.loc[[i.endswith('_test') for i in df2.index]] # 13.3 ms per loop
%timeit df2[~(df2.index.str[-5:] == '_test')] # 22.1 ms per loop
%timeit df2[~df2.index.str.endswith('_test')] # 21.7 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With