Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient searching over multiple columns

I have DataFrame like that:

df = pd.DataFrame([
    {
        'name': 'Alexandra',
        'surname': 'Johnson',
        'age': 5
    },
    {
        'name': 'Roland',
        'surname': 'Hawking',
        'age': 3
    }
])

To find rows matching a certain condition I use combination of conditions:

df[((df['name']=='Roland') | (df['surname']=='Johnson')) & (df['age'] > 4)]

returning

        name  surname  age
0  Alexandra  Johnson    5

However, this does not scale well for several hundred thousand rows. To speed up search over a column I used to index that column:

indexed = df.set_index('name')
indexed[indexed.index == 'Roland']

For multiple columns I can use MultiIndex, however I don't know how can I use combination of boolean conditions with multiple indexes. Can you advice?

like image 386
dzieciou Avatar asked Jun 03 '26 13:06

dzieciou


1 Answers

not sure if this is what u r referring to, so i'll take a stab at it:

set name and surname as the index

df = df.set_index(['name','surname'])

filter for alexandra and johnson - pass it in as a tuple

df.loc[('Alexandra', 'Johnson')]

    age    5
Name: (Alexandra, Johnson), dtype: int64

U will get much faster speeds if u use iloc, but that requires u to know the specific indexes. but yeah, with multiindex, u pass a tuple of the indexes, if it is three columns, then u will have three elements in ur tuple

like image 51
sammywemmy Avatar answered Jun 06 '26 04:06

sammywemmy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!