I've been working with pandas DataFrame objects and selecting rows based on column values.
I've noticed that if you select rows using .values first then it is about twice as fast. Why is this?
And is there any reason why you should use the first example if it is slower?
df = pd.DataFrame(np.random.randint(0, high=10, size=(1000, 4)), columns=['A', 'B', 'C', 'D'])
%timeit df_test = df[df['A'] == 9]
The slowest run took 4.98 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 363 µs per loop
%timeit df_test = df[df['A'].values == 9]
1000 loops, best of 3: 181 µs per loop
Pandas masking returns a boolean mask in the form of series or DataFrame. Numpy masking returns a boolean mask in the form of an array.
To map boolean mask to series or dataframe takes a bit of time than returning an array of booleans.
In other words mapping mask to index and returning a series is where the additional time went when you do df['A'] == 9 or else both of them would be same.
To illustrate :
df['A'] == 9
0 False
1 False
2 False
3 False
4 False
Name: A, dtype: bool
type(df['A'] == 9)
pandas.core.series.Series
df['A'].values == 9
array([False, False, False, False, False], dtype=bool)
type(df['A'].values == 9)
numpy.ndarray
So why series masking when numpy masking is way faster ?
Lets say you have dataframe with index sorted in a different way
df = pd.DataFrame(np.random.randint(0, high=10, size=(5, 4)), columns=list('ABCD'))
A B C D
0 4 9 1 5
1 8 6 5 0
2 5 5 9 5
3 2 5 7 5
4 1 1 7 2
df2 = pd.DataFrame(np.random.randint(0, high=10, size=(5, 4)), columns=list('ABCD'),index=[4,3,2,1,0])
A B C D
4 0 4 5 8
3 9 6 7 2
2 0 9 8 6
1 2 6 2 7
0 7 2 8 7
Now you want to select the row in df2 that has 4 in column A of df based on their index value
# If you do numpy masking
df2.loc[df['A'].values==4] # First index will be selected no matter what the actual index is
A B C D
4 0 4 5 8
df2.loc[df['A']==4] # Row with that index will be selected
A B C D
0 7 2 8 7
Not just this there are more cases where you want index and column data to work with so series masking is required. Hope that explains things bit better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With