Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest method of filtering a pandas data frame by category

Tags:

I have a very large data frame with 100 million rows and categorical columns. I would like to know if there is a faster way of selecting rows by category than using the .isin() method or .join() method mentioned here.

Given that the data is already categorised, I thought that it should be very fast to select categories, but a few tests I ran had disappointing performances. The only other solution I found was from here, but the solution does not seem to work for pandas 0.20.2.

Here is an example data set.

import pandas as pd
import random
import string
df = pd.DataFrame({'categories': [random.choice(string.ascii_letters) 
                                  for _ in range(1000000)]*100,
                   'values': [random.choice([0,1]) 
                              for _ in range(1000000)]*100})
df['categories'] = df['categories'].astype('category')

Testing with .isin():

%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
44 s ± 894 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using .join():

%timeit df.set_index('categories').join(
    pd.Series(index=list(string.ascii_lowercase), name='temp'), 
    how='inner').rename_axis('categories').reset_index().drop('temp', 1)
24.7 s ± 1.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 792
kayoz Avatar asked Sep 06 '17 11:09

kayoz


1 Answers

Here's a similar but different approach that directly compares the value rather than using isin.

Basic map / lambda comparison:

%timeit df[df['categories'].map(lambda x: x in string.ascii_lowercase)]
> 1 loop, best of 3: 12.3 s per loop

Using isin:

%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
> 1 loop, best of 3: 55.1 s per loop

Versions: Py 3.5.1 / IPython 5.1.0 / Pandas 0.20.3

Background: I noticed in one of the SO posts you linked to that a commenter mentioned that isin needs to create a set() during the execution, so skipping that step and doing a basic list lookup seems to be the speedup here.

Disclamer: not the type of scale I deal with regularly, so there may be even faster options out there.

Edit: Extra detail on request in comments from JohnGalt:

df.shape
> (100000000, 2)
df.dtypes
> categories    category
 values           int64
 dtype: object

To create the sample data, I copy/pasted the sample DF from the initial question. Run on a MBP, early 2015 model.

like image 177
Phil Sheard Avatar answered Sep 30 '22 00:09

Phil Sheard