I have DataFrame like that:
df = pd.DataFrame([
{
'name': 'Alexandra',
'surname': 'Johnson',
'age': 5
},
{
'name': 'Roland',
'surname': 'Hawking',
'age': 3
}
])
To find rows matching a certain condition I use combination of conditions:
df[((df['name']=='Roland') | (df['surname']=='Johnson')) & (df['age'] > 4)]
returning
name surname age
0 Alexandra Johnson 5
However, this does not scale well for several hundred thousand rows. To speed up search over a column I used to index that column:
indexed = df.set_index('name')
indexed[indexed.index == 'Roland']
For multiple columns I can use MultiIndex, however I don't know how can I use combination of boolean conditions with multiple indexes. Can you advice?
not sure if this is what u r referring to, so i'll take a stab at it:
set name and surname as the index
df = df.set_index(['name','surname'])
filter for alexandra and johnson - pass it in as a tuple
df.loc[('Alexandra', 'Johnson')]
age 5
Name: (Alexandra, Johnson), dtype: int64
U will get much faster speeds if u use iloc, but that requires u to know the specific indexes. but yeah, with multiindex, u pass a tuple of the indexes, if it is three columns, then u will have three elements in ur tuple
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With