Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: slow when selecting from dataframe

Tags:

python

pandas

I have a DataFrame that's created from a list of dictionaries, which I obtained from a database. I'm trying to use that as an in-memory database, which I query using the following function:

def filter_entities(df, name1, name2):
    key = ((df.name1 == name1) &
           (df.name2 == name2))

    rows = df.loc[key]
    if len(rows) == 0:
        return None
    return rows.iloc[0]

Doing that seems to be much slower than I expected. Even when testing on a couple of hundred rows, it takes about 1ms per call. I've tried setting an index on those columns when I create the dataframe but it didn't affect the performance:

entities.set_index(['name1', 'name2'], drop=False, inplace=True)

Here's a quick way to create a dataset for testing:

import random, string
import pandas as pd
df = pd.DataFrame([{
   'name1': ''.join([random.choice(string.letters) for i in range(10)]),
   'name2': ''.join([random.choice(string.letters) for i in range(10)]),
   'val1': random.randint(0, 2**16),
   'val2': random.randint(0, 2**16),
   'val3': random.randint(0, 2**16),
   } for j in range(1000)])

In[27]: %timeit filter_entities(df, df['name1'][100], df['name2'][100])
1000 loops, best of 3: 1.91 ms per loop

I'm trying to find an efficient way to make queries on my data. Is there a better way to do this in pandas?

like image 361
noamk Avatar asked Nov 01 '22 10:11

noamk


1 Answers

Setting the columns as the index does improve the performance for me.

With your original filter_entities function:

In [25]: %timeit filter_entities(df, df['name1'][100], df['name2'][100])

1000 loops, best of 3: 1.36 ms per loop

And after setting the columns as the index, and then indexing that frame:

In [26]: df2 = df.set_index(['name1', 'name2'])

In [27]: %timeit df2.loc[df['name1'][100], df['name2'][100]]
10000 loops, best of 3: 160 µs per loop

Note that a large portion of the time spent in the filter_entities function is used for the boolean comparison (creation of the key, and not for the indexing itself).

Second note: if the performance on this scale matters to you, then in many cases it is also useful to think about the larger picture if you need to access individual rows in this way repeatedly, or if you can do it more vectorized.

like image 56
joris Avatar answered Nov 04 '22 08:11

joris