I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one. This is the first piece: <pre class="prettyprint"><code>t1 = time.time() df[new_col] = np.where(df[col] < j, val_1, val_2) t2 = time.time() ts.append(t2 - t1) </code></pre> In <code>ts</code> I have values like: <pre class="prettyprint"><code>0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344 </code></pre> In contrast, this part of the code: <pre class="prettyprint"><code>t1 = time.time() df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col) t2 = time.time() ts.append(t2 - t1) </code></pre> Creates <code>ts</code> populated with the values like: <pre class="prettyprint"><code>0.11008906364440918, 0.09556794166564941, 0.08580684661865234 </code></pre> I cannot figure out what the essential difference is between the first and second assignments. In both cases <code>df</code> should be the same. ADDED It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had: <pre class="prettyprint"><code>df = inp_df.copy() </code></pre> in the beginning of the class method (where <code>inp_df</code> was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.

<h3>Assignment is not the bottleneck</h3> Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as <code>pd.Series</code>, <code>np.ndarray</code> or <code>list</code>. <h3>Broadcasting is even cheaper</h3> Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as <code>val_1</code> and <code>val_2</code> in the first example. Your second example has a series assignment for the case where your condition is not met. This is relatively expensive. <h3>Calculations are relatively expensive</h3> On the other hand, the calculations you perform are relatively expensive. In the first example, you have one calculation: <pre class="prettyprint"><code>df[col] < j </code></pre> In the second example, you have at least three calculations: <pre class="prettyprint"><code>a = df[col] >= i1 b = df[col] < i2 a & b </code></pre> Therefore, you can and should expect the second version to be more expensive. <h3>Use <code>timeit</code> </h3> It's good practice to use the <code>timeit</code> module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim: <pre class="prettyprint"><code>import pandas as pd, numpy as np np.random.seed(0) df = pd.DataFrame({'A': np.random.random(10**7)}) j = 0.5 i1, i2 = 0.25, 0.75 %timeit np.where(df['A'] < j, 1, 2) # 85.5 ms per loop %timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A']) # 161 ms per loop </code></pre> One calculation is cheaper than 3 calculations: <pre class="prettyprint"><code>%timeit df['A'] < j # 14.8 ms per loop %timeit (df['A'] >= i1) & (df['A'] < i2) # 65.6 ms per loop </code></pre> Broadcasting via scalar values is cheaper than assigning series: <pre class="prettyprint"><code>%timeit np.where(df['A'] < j, 1, df['A']) # 113 ms per loop %timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2) # 146 ms per loop </code></pre>

First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython: <pre class="prettyprint"><code>In [3]: %timeit 1 < 2 20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [4]: %timeit 1 >= 0 & 1 < 2 37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) </code></pre>

Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

Tags:

performance

python

pandas

numpy

I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.

This is the first piece:

t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1)

In ts I have values like:

0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344

In contrast, this part of the code:

t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)

Creates ts populated with the values like:

0.11008906364440918, 0.09556794166564941, 0.08580684661865234

I cannot figure out what the essential difference is between the first and second assignments.

In both cases df should be the same.

ADDED

It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:

df = inp_df.copy()

in the beginning of the class method (where inp_df was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.

804

asked Dec 05 '18 12:12

Roman

2 Answers

Assignment is not the bottleneck

Assigning values to Pandas series is cheap, especially if you are assigning via regular objects such as pd.Series, np.ndarray or list.

Broadcasting is even cheaper

Note broadcasting is extremely cheap, i.e. when you are setting scalar values such as val_1 and val_2 in the first example.

Your second example has a series assignment for the case where your condition is not met. This is relatively expensive.

Calculations are relatively expensive

On the other hand, the calculations you perform are relatively expensive.

In the first example, you have one calculation:

df[col] < j

In the second example, you have at least three calculations:

a = df[col] >= i1
b = df[col] < i2
a & b

Therefore, you can and should expect the second version to be more expensive.

Use `timeit`

It's good practice to use the timeit module for reliable performance timings. The reproducible example below shows a smaller performance differential than what you claim:

import pandas as pd, numpy as np

np.random.seed(0)
df = pd.DataFrame({'A': np.random.random(10**7)})

j = 0.5
i1, i2 = 0.25, 0.75

%timeit np.where(df['A'] < j, 1, 2)                             # 85.5 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, df['A'])  # 161 ms per loop

One calculation is cheaper than 3 calculations:

%timeit df['A'] < j                                             # 14.8 ms per loop
%timeit (df['A'] >= i1) & (df['A'] < i2)                        # 65.6 ms per loop

Broadcasting via scalar values is cheaper than assigning series:

%timeit np.where(df['A'] < j, 1, df['A'])                       # 113 ms per loop
%timeit np.where((df['A'] >= i1) & (df['A'] < i2), 1, 2)        # 146 ms per loop

155

answered Nov 14 '22 22:11

jpp

First time you use only one condition so it should be faster than you do check the two conditions. Simple example use ipython:

In [3]: %timeit 1 < 2                                                                                                                                         
20.4 ns ± 0.434 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit 1 >= 0 & 1 < 2                                                                                                                                
37 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

answered Nov 15 '22 00:11

Brown Bear

Related questions
                            
                                How can I open a .snappy.parquet file in python?
                            
                                Django Admin Form: Set the default value of a readonly field
                            
                                Why import class from another file will call __init__ function?
                            
                                FastAI library v1 with Google Colab
                            
                                How to install mpl_finance packages into environment on Anaconda?
                            
                                pip install urllib3 hanging on "Caching due to etag"
                            
                                How do I generate python grpc code from within a setuptools installer (setup.py)?
                            
                                How to compute Shannon entropy of Information from a Pandas Dataframe?
                            
                                How does sys.executable determine the interpreter path?
                            
                                From pathlib parts tuple to string path
                            
                                Adding a new column in the first ordinal position in a pyspark dataframe
                            
                                For loop to print old value and sum of old value
                            
                                ValueError: Found array with 0 sample (s) (shape= (0, 1) while a minimum of 1 is required by MinMaxScaler
                            
                                Distributing jobs evenly across multiple GPUs with `multiprocessing.Pool`
                            
                                Modify field names in serializer in Django Rest Framework
                            
                                Python Optimized Most Cosine Similar Vector
                            
                                Is there a "with conn.cursor() as..." way to work with Sqlite?
                            
                                Finding the size of a DXF file using EZDXF Python
                            
                                Read Python stdin from pipe, without blocking on empty input
                            
                                Making a clustered bar chart, Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

Tags:

performance

python

pandas

numpy

Roman

People also ask

2 Answers

Assignment is not the bottleneck

Broadcasting is even cheaper

Calculations are relatively expensive

Use `timeit`

jpp

Brown Bear

Recent Activity

Donate For Us

Why is changing values in a column of a pandas data frame fast in one case and slow in another one?

Tags:

performance

python

pandas

numpy

Roman

People also ask

2 Answers

Assignment is not the bottleneck

Broadcasting is even cheaper

Calculations are relatively expensive

Use timeit

jpp

Brown Bear

Related questions

Recent Activity

Donate For Us

Use `timeit`