Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to remove columns in pandas

I am raising this question for my self learning. As far as I know, followings are the different methods to remove columns in pandas dataframe.

Option - 1:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
del df['a']

Option - 2:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df.drop('a',1)

Option - 3:

df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
df=df[['b','c']]
  1. What is the best approach among these?
  2. Any other approaches to achieve the same?
like image 646
Mohamed Thasin ah Avatar asked Jul 04 '18 07:07

Mohamed Thasin ah


People also ask

How do I remove multiple columns from a dataset in Python?

We can use Pandas drop() function to drop multiple columns from a dataframe. Pandas drop() is versatile and it can be used to drop rows of a dataframe as well. To use Pandas drop() function to drop columns, we provide the multiple columns that need to be dropped as a list.


4 Answers

Follow the doc:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

And pandas.DataFrame.drop:

Drop specified labels from rows or columns.

So, I think we should stick with df.drop. Why? I think the pros are:

  1. It gives us more control of the remove action:

    # This will return a NEW DataFrame object, leave the original `df` untouched.
    df.drop('a', axis=1)  
    # This will modify the `df` inplace. **And return a `None`**.
    df.drop('a', axis=1, inplace=True)  
    
  2. It can handle more complicated cases with it's args. E.g. with level, we can handle MultiIndex deletion. And with errors, we can prevent some bugs.

  3. It's a more unified and object oriented way.


And just like @jezrael noted in his answer:

Option 1: Using key word del is a limited way.

Option 3: And df=df[['b','c']] isn't even a deletion in essence. It first select data by indexing with [] syntax, then unbind the name df with the original DataFrame and bind it with the new one (i.e. df[['b','c']]).

like image 186
YaOzI Avatar answered Oct 18 '22 23:10

YaOzI


The recommended way to delete a column or row in pandas dataframes is using drop.

To delete a column,

df.drop('column_name', axis=1, inplace=True)

To delete a row,

df.drop('row_index', axis=0, inplace=True)

You can refer this post to see a detailed conversation about column delete approaches.

like image 32
razmik Avatar answered Oct 18 '22 23:10

razmik


From a speed perspective, option 1 seems to be the best. Obviously, based on the other answers, that doesn't mean it's actually the best option.

In [52]: import timeit

In [53]: s1 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: del df['a']
    ...: """

In [54]: s2 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: df=df.drop('a',1)
    ...: """

In [55]: s3 = """
    ...: import pandas as pd
    ...: df=pd.DataFrame({'a':[1,2,3,4,5],'b':[6,7,8,9,10],'c':[11,12,13,14,15]})
    ...: df=df[['b','c']]
    ...: """

In [56]: timeit.timeit(stmt=s1, number=100000)
Out[56]: 53.37321400642395

In [57]: timeit.timeit(stmt=s2, number=100000)
Out[57]: 79.68139410018921

In [58]: timeit.timeit(stmt=s3, number=100000)
Out[58]: 76.25269913673401
like image 4
aydow Avatar answered Oct 19 '22 00:10

aydow


In my opinion the best is use 2. and 3. option, because first has limits - you can remove only one column and cannot use dot notation - del df.a.

3.solution is not deleting, but selecting and piRSquared create nice answer for multiple possible solutions with same idea.

like image 2
jezrael Avatar answered Oct 18 '22 22:10

jezrael