I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.
This is the first piece:
t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1)
In ts
I have values like:
0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344
In contrast, this part of the code:
t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)
Creates ts
populated with the values like:
0.11008906364440918, 0.09556794166564941, 0.08580684661865234
I cannot figure out what the essential difference is between the first and second assignments.
In both cases df
should be the same.
ADDED
It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:
df = inp_df.copy()
in the beginning of the class method (where inp_df
was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.
Pandas is all around excellent. But Pandas isn't particularly fast. When you're dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you're dealing with millions of computations, your total computation time stretches on and on and on....
Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as pd.Series
, np.ndarray
or list
.
Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as val_1
and val_2
in the first example.
Your second example has a series assignment for the case where your condition is not met. This is relatively expensive.
On the other hand, the calculations you perform are relatively expensive.
In the first example, you have one calculation:
df[col] < j
In the second example, you have at least three calculations:
a = df[col] >= i1
b = df[col] < i2
a & b
Therefore, you can and should expect the second version to be more expensive.
timeit
It's good practice to use the timeit
module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim:
import pandas as pd, numpy as np
np.random.seed(0)
df = pd.DataFrame({'A': np.random.random(10**7)})
j = 0.5
i1, i2 = 0.25, 0.75
%timeit np.where(df['A'] < j, 1, 2) # 85.5 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A']) # 161 ms per loop
One calculation is cheaper than 3 calculations:
%timeit df['A'] < j # 14.8 ms per loop
%timeit (df['A'] >= i1) & (df['A'] < i2) # 65.6 ms per loop
Broadcasting via scalar values is cheaper than assigning series:
%timeit np.where(df['A'] < j, 1, df['A']) # 113 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2) # 146 ms per loop
First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython:
In [3]: %timeit 1 < 2
20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [4]: %timeit 1 >= 0 & 1 < 2
37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With