I have a very large data frame with 100 million rows and categorical columns. I would like to know if there is a faster way of selecting rows by category than using the .isin()
method or .join()
method mentioned here.
Given that the data is already categorised, I thought that it should be very fast to select categories, but a few tests I ran had disappointing performances. The only other solution I found was from here, but the solution does not seem to work for pandas 0.20.2.
Here is an example data set.
import pandas as pd
import random
import string
df = pd.DataFrame({'categories': [random.choice(string.ascii_letters)
for _ in range(1000000)]*100,
'values': [random.choice([0,1])
for _ in range(1000000)]*100})
df['categories'] = df['categories'].astype('category')
Testing with .isin()
:
%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
44 s ± 894 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using .join()
:
%timeit df.set_index('categories').join(
pd.Series(index=list(string.ascii_lowercase), name='temp'),
how='inner').rename_axis('categories').reset_index().drop('temp', 1)
24.7 s ± 1.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here's a similar but different approach that directly compares the value rather than using isin
.
Basic map
/ lambda comparison:
%timeit df[df['categories'].map(lambda x: x in string.ascii_lowercase)]
> 1 loop, best of 3: 12.3 s per loop
Using isin
:
%timeit df[df['categories'].isin(list(string.ascii_lowercase))]
> 1 loop, best of 3: 55.1 s per loop
Versions: Py 3.5.1 / IPython 5.1.0 / Pandas 0.20.3
Background: I noticed in one of the SO posts you linked to that a commenter mentioned that isin
needs to create a set()
during the execution, so skipping that step and doing a basic list lookup seems to be the speedup here.
Disclamer: not the type of scale I deal with regularly, so there may be even faster options out there.
Edit: Extra detail on request in comments from JohnGalt:
df.shape
> (100000000, 2)
df.dtypes
> categories category
values int64
dtype: object
To create the sample data, I copy/pasted the sample DF from the initial question. Run on a MBP, early 2015 model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With