I am raising this question for my self learning. As far as I know, followings are the different methods to remove columns in pandas dataframe.
Option - 1:
df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
del df['a']
Option - 2:
df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df.drop('a',1)
Option - 3:
df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df[['b','c']]
We can use Pandas drop() function to drop multiple columns from a dataframe. Pandas drop() is versatile and it can be used to drop rows of a dataframe as well. To use Pandas drop() function to drop columns, we provide the multiple columns that need to be dropped as a list.
Follow the doc:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
And pandas.DataFrame.drop
:
Drop specified labels from rows or columns.
So, I think we should stick with df.drop
. Why? I think the pros are:
It gives us more control of the remove action:
# This will return a NEW DataFrame object, leave the original `df` untouched.
df.drop('a', axis=1)
# This will modify the `df` inplace. **And return a `None`**.
df.drop('a', axis=1, inplace=True)
It can handle more complicated cases with it's args. E.g. with level
, we can handle MultiIndex deletion. And with errors
, we can prevent some bugs.
It's a more unified and object oriented way.
And just like @jezrael noted in his answer:
Option 1: Using key word del
is a limited way.
Option 3: And df=df[['b','c']]
isn't even a deletion in essence. It first select data by indexing with []
syntax, then unbind the name df
with the original DataFrame and bind it with the new one (i.e. df[['b','c']]
).
The recommended way to delete a column or row in pandas dataframes is using drop.
To delete a column,
df.drop('column_name', axis=1, inplace=True)
To delete a row,
df.drop('row_index', axis=0, inplace=True)
You can refer this post to see a detailed conversation about column delete approaches.
From a speed perspective, option 1 seems to be the best. Obviously, based on the other answers, that doesn't mean it's actually the best option.
In [52]: import timeit
In [53]: s1 = """
...: import pandas as pd
...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
...: del df['a']
...: """
In [54]: s2 = """
...: import pandas as pd
...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
...: df=df.drop('a',1)
...: """
In [55]: s3 = """
...: import pandas as pd
...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
...: df=df[['b','c']]
...: """
In [56]: timeit.timeit(stmt=s1, number=100000)
Out[56]: 53.37321400642395
In [57]: timeit.timeit(stmt=s2, number=100000)
Out[57]: 79.68139410018921
In [58]: timeit.timeit(stmt=s3, number=100000)
Out[58]: 76.25269913673401
In my opinion the best is use 2. and 3. option, because first has limits - you can remove only one column and cannot use dot notation - del df.a
.
3.solution is not deleting, but selecting and piRSquared create nice answer for multiple possible solutions with same idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With