Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas drop columns based on max value of column

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.

My dataframe (simplified):

Date       Stock1  Stock2   Stock3
2014.10.10  74.75  NaN     NaN
2014.9.9    NaN    100.95  NaN 
2010.8.8    NaN    NaN     120.45

So each column only has one value.

I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:

Date        Stock2   Stock3
2014.10.10   NaN     NaN
2014.9.9     100.95  NaN 
2010.8.8     NaN     120.45

How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?

like image 869
professorDante Avatar asked Nov 12 '14 22:11

professorDante


People also ask

Can you drop columns by index in Pandas?

You can drop columns by index by using DataFrame. drop() method and by using DataFrame. iloc[].

How do you drop rows in Pandas based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

How do you drop Pandas rows based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).


1 Answers

Use the df.max() to index with.

In [19]: from pandas import DataFrame

In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])

In [36]: df
Out[36]: 
          a         b         c
0 -0.928912  0.220573  1.948065
1 -0.310504  0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567


In [24]: df.max()
Out[24]: 
a   -0.310504
b    0.847638
c    1.948065
dtype: float64

Next, we make a boolean expression out of this:

In [31]: df.max() > 0
Out[31]: 
a    False
b     True
c     True
dtype: bool

Next, you can index df.columns by this (this is called boolean indexing):

In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')

Which you can finally pass to DF:

In [35]: df[df.columns[df.max() > 0]]
Out[35]: 
          b         c
0  0.220573  1.948065
1  0.847638 -0.541496
2 -1.099226 -1.183567

Of course, instead of 0, you use any value that you want as the cutoff for dropping.

like image 165
Adam Hughes Avatar answered Oct 14 '22 03:10

Adam Hughes