Given a DataFrame with a column "BoolCol", we want to find the indexes of the DataFrame in which the values for "BoolCol" == True
I currently have the iterating way to do it, which works perfectly:
for i in range(100,3000):
if df.iloc[i]['BoolCol']== True:
print i,df.iloc[i]['BoolCol']
But this is not the correct pandas way to do it. After some research, I am currently using this code:
df[df['BoolCol'] == True].index.tolist()
This one gives me a list of indexes, but they don't match, when I check them by doing:
df.iloc[i]['BoolCol']
The result is actually False!!
Which would be the correct pandas way to do this?
The get_loc() function is used to find the index of any column in the Python pandas dataframe. We simply pass the column name to get_loc() function to find index.
To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.
In Python Pandas DataFrame. idmax() method is used to get the index of the first occurrence of maximum over the given axis and this method will always return the index of the maximum value and if the column or row in the Pandas DataFrame is empty then it will raise a Value Error.
df.iloc[i]
returns the ith
row of df
. i
does not refer to the index label, i
is a 0-based index.
In contrast, the attribute index
returns actual index labels, not numeric row-indices:
df.index[df['BoolCol'] == True].tolist()
or equivalently,
df.index[df['BoolCol']].tolist()
You can see the difference quite clearly by playing with a DataFrame with a non-default index that does not equal to the row's numerical position:
df = pd.DataFrame({'BoolCol': [True, False, False, True, True]},
index=[10,20,30,40,50])
In [53]: df
Out[53]:
BoolCol
10 True
20 False
30 False
40 True
50 True
[5 rows x 1 columns]
In [54]: df.index[df['BoolCol']].tolist()
Out[54]: [10, 40, 50]
If you want to use the index,
In [56]: idx = df.index[df['BoolCol']]
In [57]: idx
Out[57]: Int64Index([10, 40, 50], dtype='int64')
then you can select the rows using loc
instead of iloc
:
In [58]: df.loc[idx]
Out[58]:
BoolCol
10 True
40 True
50 True
[3 rows x 1 columns]
Note that loc
can also accept boolean arrays:
In [55]: df.loc[df['BoolCol']]
Out[55]:
BoolCol
10 True
40 True
50 True
[3 rows x 1 columns]
If you have a boolean array, mask
, and need ordinal index values, you can compute them using np.flatnonzero
:
In [110]: np.flatnonzero(df['BoolCol'])
Out[112]: array([0, 3, 4])
Use df.iloc
to select rows by ordinal index:
In [113]: df.iloc[np.flatnonzero(df['BoolCol'])]
Out[113]:
BoolCol
10 True
40 True
50 True
Can be done using numpy where() function:
import pandas as pd
import numpy as np
In [716]: df = pd.DataFrame({"gene_name": ['SLC45A1', 'NECAP2', 'CLIC4', 'ADC', 'AGBL4'] , "BoolCol": [False, True, False, True, True] },
index=list("abcde"))
In [717]: df
Out[717]:
BoolCol gene_name
a False SLC45A1
b True NECAP2
c False CLIC4
d True ADC
e True AGBL4
In [718]: np.where(df["BoolCol"] == True)
Out[718]: (array([1, 3, 4]),)
In [719]: select_indices = list(np.where(df["BoolCol"] == True)[0])
In [720]: df.iloc[select_indices]
Out[720]:
BoolCol gene_name
b True NECAP2
d True ADC
e True AGBL4
Though you don't always need index for a match, but incase if you need:
In [796]: df.iloc[select_indices].index
Out[796]: Index([u'b', u'd', u'e'], dtype='object')
In [797]: df.iloc[select_indices].index.tolist()
Out[797]: ['b', 'd', 'e']
If you want to use your dataframe object only once, use:
df['BoolCol'].loc[lambda x: x==True].index
Simple way is to reset the index of the DataFrame prior to filtering:
df_reset = df.reset_index()
df_reset[df_reset['BoolCol']].index.tolist()
Bit hacky, but it's quick!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With