I have a dataframe:
import pandas as pd
df = pd.DataFrame(np.random.randn(2, 4))
print(df)
0 1 2 3
0 1.489198 1.329603 1.590124 1.123505
1 0.024017 0.581033 2.500397 0.156280
I want to select the columns which for there is at least one row with a value greater than 2
. I tried the following, but it did not work as expected.
df[df.columns[df.iloc[(0,1)]>2]]
In this toy example my expected output would be:
2
1.590124
2.500397
You can select the Rows from Pandas DataFrame based on column values or based on multiple conditions either using DataFrame. loc[] attribute, DataFrame. query() or DataFrame. apply() method to use lambda function.
Use gt
and any
to filter the df:
In [287]:
df.ix[:,df.gt(2).any()]
Out[287]:
2
0 1.590124
1 2.500397
Here we use ix
to select all rows, the first :
and the next arg is a boolean mask of the columns that meet the condition:
In [288]:
df.gt(2)
Out[288]:
0 1 2 3
0 False False False False
1 False False True False
In [289]:
df.gt(2).any()
Out[289]:
0 False
1 False
2 True
3 False
dtype: bool
In your example what you did was select the cell value for the first row and second column, you then tried to use this to mask the columns but this just returned the first column hence why it didn't work:
In [291]:
df.iloc[(0,1)]
Out[291]:
1.3296030000000001
In [293]:
df.columns[df.iloc[(0,1)]>2]
Out[293]:
'0'
Use mask
created with df > 2
with any
and then select columns by ix
:
import pandas as pd
np.random.seed(18)
df = pd.DataFrame(np.random.randn(2, 4))
print(df)
0 1 2 3
0 0.079428 2.190202 -0.134892 0.160518
1 0.442698 0.623391 1.008903 0.394249
print ((df>2).any())
0 False
1 True
2 False
3 False
dtype: bool
print (df.ix[:, (df>2).any()])
1
0 2.190202
1 0.623391
EDIT by comment:
You can check your solution per partes:
It seems it works, but it always select second column (1
, python count from 0
) column if condition True
:
print (df.iloc[(0,1)])
2.19020235741
print (df.iloc[(0,1)] > 2)
True
print (df.columns[df.iloc[(0,1)]>2])
1
print (df[df.columns[df.iloc[(0,1)]>2]])
0 2.190202
1 0.623391
Name: 1, dtype: float64
And first column (0
) column if False
, because boolean True
and False
are casted to 1
and 0
:
np.random.seed(15)
df = pd.DataFrame(np.random.randn(2, 4))
print (df)
0 1 2 3
0 -0.312328 0.339285 -0.155909 -0.501790
1 0.235569 -1.763605 -1.095862 -1.087766
print (df.iloc[(0,1)])
0.339284706046
print (df.iloc[(0,1)] > 2)
False
print (df.columns[df.iloc[(0,1)]>2])
0
print (df[df.columns[df.iloc[(0,1)]>2]])
0 -0.312328
1 0.235569
Name: 0, dtype: float64
If change column names:
np.random.seed(15)
df = pd.DataFrame(np.random.randn(2, 4))
df.columns = ['a','b','c','d']
print (df)
a b c d
0 -0.312328 0.339285 -0.155909 -0.501790
1 0.235569 -1.763605 -1.095862 -1.087766
print (df.iloc[(0,1)] > 2)
False
print (df[df.columns[df.iloc[(0,1)]>2]])
0 -0.312328
1 0.235569
Name: a, dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With