I often need to filter pandas dataframe df
by df[df['col_name']=='string_value']
, and I want to speed up the row selction operation, is there a quick way to do that ?
For example,
In [1]: df = mul_df(3000,2000,3).reset_index()
In [2]: timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 1.52 s per loop
Can 1.52s be shorten ?
Note:
mul_df()
is function to create multilevel dataframe:
>>> mul_df(4,2,3)
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 0.6399 0.0062 1.0022
B001 -0.2881 -2.0604 1.2481
A0001 B000 0.7070 -0.9539 -0.5268
B001 0.8860 -0.5367 -2.4492
A0002 B000 -2.4738 0.9529 -0.9789
B001 0.1392 -1.0931 -0.2077
A0003 B000 -1.1377 0.5455 -0.2290
B001 1.0083 0.2746 -0.3934
Below is the code of mul_df():
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:
In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted
In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000
In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000
In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 µs per loop
This is fast because it always retrieves views and does not copy any data.
Somewhat surprisingly, working with the .values
array instead of the Series
is much faster for me:
>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With