Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.

This is the first piece:

t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1) 

In ts I have values like:

0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344

In contrast, this part of the code:

t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)

Creates ts populated with the values like:

0.11008906364440918, 0.09556794166564941, 0.08580684661865234

I cannot figure out what the essential difference is between the first and second assignments.

In both cases df should be the same.

ADDED

It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:

df = inp_df.copy()

in the beginning of the class method (where inp_df was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.

like image 804
Roman Avatar asked Dec 05 '18 12:12

Roman


People also ask

How replace column values in pandas based on multiple conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

How can you speed up computations with pandas?

Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.

Is pandas DataFrame slow?

Pandas is all around excellent. But Pandas isn't particularly fast. When you're dealing with many computations and your processing method is slow, the program takes a long time to run. This means, if you're dealing with millions of computations, your total computation time stretches on and on and on....


2 Answers

Assignment is not the bottleneck

Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as pd.Series, np.ndarray or list.

Broadcasting is even cheaper

Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as val_1 and val_2 in the first example.

Your second example has a series assignment for the case where your condition is not met. This is relatively expensive.

Calculations are relatively expensive

On the other hand, the calculations you perform are relatively expensive.

In the first example, you have one calculation:

df[col] < j

In the second example, you have at least three calculations:

a = df[col] >= i1
b = df[col] < i2
a & b

Therefore, you can and should expect the second version to be more expensive.

Use timeit

It's good practice to use the timeit module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim:

import pandas as pd, numpy as np

np.random.seed(0)
df = pd.DataFrame({'A': np.random.random(10**7)})

j = 0.5
i1, i2 = 0.25, 0.75

%timeit np.where(df['A'] < j, 1, 2)                             # 85.5 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A'])  # 161 ms per loop

One calculation is cheaper than 3 calculations:

%timeit df['A'] < j                                             # 14.8 ms per loop
%timeit (df['A'] >= i1) & (df['A'] < i2)                        # 65.6 ms per loop

Broadcasting via scalar values is cheaper than assigning series:

%timeit np.where(df['A'] < j, 1, df['A'])                       # 113 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2)        # 146 ms per loop
like image 155
jpp Avatar answered Nov 14 '22 22:11

jpp


First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython:

In [3]: %timeit 1 < 2                                                                                                                                         
20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit 1 >= 0 & 1 < 2                                                                                                                                
37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
like image 25
Brown Bear Avatar answered Nov 15 '22 00:11

Brown Bear