I've observed today that selecting two or more columns of Data frame may be much slower than selecting only one. If I use loc, or iloc to choose more than one column and I use list to pass column names or indexes, then performance drops 100 times in comparison to single column or many column selection with iloc (but no list passed) examples: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij')) </code></pre> One column selection: <pre class="prettyprint"><code>%%timeit -n 100 df['b'] 3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit -n 100 df.iloc[:,1] 66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit -n 100 df.loc[:,'b'] 44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> Two columns selection: <pre class="prettyprint"><code>%%timeit -n 10 df[['b', 'c']] 96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.loc[:,['b', 'c']] 99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.iloc[:,[1,2]] 97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> Only this selection works like expected: [EDIT] <pre class="prettyprint"><code>%%timeit -n 100 df.iloc[:,1:3] 103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> What are the differences in mechanisms and why they are so big? [EDIT]: As @run-out pointed out, pd.Series seems to be processed much faster than pd.DataFrame, anyone knows why it is the case? On the other hand - it does not explain difference between <code>df.iloc[:,[1,2]]</code> and <code>df.iloc[:,1:3]</code>

I found this probably rooted in numpy. numpy has two kind of indexing: <ol> <li>basic indexing like a[1:3]</li> <li>advanced indexing like a[[1,2]]</li> </ol> according to the documentation, <blockquote> Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view). </blockquote> So if you check <pre class="prettyprint"><code>a=df.values %timeit -n2 a[:,0:3] %timeit -n2 a[:,[0,1,2]] </code></pre> you got <pre class="prettyprint"><code>The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached. 1.57 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 2 loops each) 188 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 2 loops each) </code></pre> quite similar behavior to pandas dataframe

pandas performance: columns selection

Tags:

pandas

I've observed today that selecting two or more columns of Data frame may be much slower than selecting only one.

If I use loc, or iloc to choose more than one column and I use list to pass column names or indexes, then performance drops 100 times in comparison to single column or many column selection with iloc (but no list passed)

examples:

df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij'))

One column selection:

%%timeit -n 100
df['b']
3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.iloc[:,1]
66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
df.loc[:,'b']
44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Two columns selection:

%%timeit -n 10
df[['b', 'c']]
96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.loc[:,['b', 'c']]
99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.iloc[:,[1,2]]
97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Only this selection works like expected: [EDIT]

%%timeit -n 100
df.iloc[:,1:3]
103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

What are the differences in mechanisms and why they are so big?

[EDIT]: As @run-out pointed out, pd.Series seems to be processed much faster than pd.DataFrame, anyone knows why it is the case?

On the other hand - it does not explain difference between df.iloc[:,[1,2]] and df.iloc[:,1:3]

372

asked Feb 19 '19 13:02

Daniel R

2 Answers

Pandas works with single rows or columns as a pandas.Series, which would be faster than working within the DataFrame architecture.

Pandas works with pandas.Series when you ask for:

%%timeit -n 10
df['b']
2.31 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, I can call a DataFrame for the same column by putting it in a list. Then you get:

%%timeit -n 10
df[['b']]
90.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

You can see from the above that it's the Series that is outperforming the DataFrame.

Here is how Pandas is working with column 'b'.

type(df['b'])
pandas.core.series.Series

type(df[['b']])
pandas.core.frame.DataFrame

EDIT: I'm expanding on my answer as OP wants to dig deeper into why there is greater speed is so much greater for pd.series vs. pd.dataframe. And also as this is a great question to expand my/our understanding of how the underlying technology works. Those with more expertise please chime in.

First let's start with numpy as it's a building block of pandas. According to Wes McKinney, author of pandas and from Python for Data Analysis, the performance pick up in numpy over python:

This is based partly on performance differences having to do with the
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g.,
summing the rows of a C order array) will generally be the fastest because the mem‐
ory subsystem will buffer the appropriate blocks of memory into the ultrafast L1 or
L2 CPU cache.

Let's see the speed difference for this example. Let's make a numpy array from column 'b' of the dataframe.

a = np.array(df['b'])

And now do the performance test:

%%timeit -n 10
a

The results are:

32.5 ns ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

That's a serious pick up in performance over the pd.series time of 2.31 µs.

The other main reason for performance pickup is that numpy indexing goes straight into NumPy C extensions, but there is a lot of python stuff going on when you index into a Series, and this is a lot slower. (read this article)

Let's look at the question of why does:

df.iloc[:,1:3]

drastically outperform:

df.iloc[:,[1,2]]

It's interesting to note that .loc has the same effect with performance as .iloc in this scenario.

Our first big clue that something is not right is in the following code:

df.iloc[:,1:3] is df.iloc[:,[1,2]]
False

These give the same result, but are different objects. I've done a deep dive try to find out what the difference is. I was unable to find reference to this on the internet or in my library of books.

Looking at the source code, we can start to see some difference. I refer to indexing.py.

In the Class _iLocIndexer we can find some extra work being done by pandas for list in an iloc slice.

Right away, we run into these two difference when checking input:

if isinstance(key, slice):
            return

vs.

elif is_list_like_indexer(key):
            # check that the key does not exceed the maximum size of the index
            arr = np.array(key)
            l = len(self.obj._get_axis(axis))

            if len(arr) and (arr.max() >= l or arr.min() < -l):
                raise IndexError("positional indexers are out-of-bounds")

Could this alone be cause enough for the reduced performance? I don't know.

Although .loc is slightly different, it also suffers performance when using a list of values. Looking in index.py, look at def _getitem_axis(self, key, axis=None): --> in class _LocIndexer(_LocationIndexer):

The code section for is_list_like_indexer(key) that handles list inputs is quite long including a lot of overhead. It contains the note:

# convert various list-like indexers
# to a list of keys
# we will use the *values* of the object
# and NOT the index if its a PandasObject

Certainly there is enough additional overhead in dealing with a list of values or integers then direct slices to cause delays in processing.

The rest of the code is past my pay grade. If anyone can have a look and chime it, it would be most welcome

150

answered Sep 21 '22 02:09

run-out

I found this probably rooted in numpy.

numpy has two kind of indexing:

basic indexing like a[1:3]
advanced indexing like a[[1,2]]

according to the documentation,

Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

So if you check

a=df.values
%timeit -n2 a[:,0:3]
%timeit -n2 a[:,[0,1,2]]

you got

The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1.57 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
188 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

quite similar behavior to pandas dataframe

answered Sep 22 '22 02:09

user15964

Related questions
                            
                                How to change the linestyle of whiskers in pandas boxplots?
                            
                                How to get max value and name from a Pandas series?
                            
                                Pandas groupby() on one column and then sum on another
                            
                                Pandas explode list of dictionaries into rows
                            
                                Python read csv with Hebrew header
                            
                                Slice a Dataframe Index per year - Python - Pandas
                            
                                python pandas: split comma-separated column into new columns - one per value
                            
                                datetime to decimal hour and minutes in python3
                            
                                Tensorflow Error: "Label IDs must < n_classes", but my Label IDs appear to meet this requirement already
                            
                                how to 'fuzzy' match strings when merge two dataframe in pandas
                            
                                Format pandas y axis to show time instead of total seconds
                            
                                Pandas: select rows if keyword appears in any column
                            
                                AttributeError: 'ExceptionInfo' object has no attribute 'traceback' when using pytest to assert exceptions
                            
                                How to convert int64 to datetime in pandas
                            
                                Subset a data frame based on index value [duplicate]
                            
                                How to compute Shannon entropy of Information from a Pandas Dataframe?
                            
                                Making a clustered bar chart, Pandas
                            
                                Why is changing values in a column of a pandas data frame fast in one case and slow in another one?
                            
                                Append list to pandas DataFrame as new row with index
                            
                                LabelEncoder that keeps missing values as 'NaN'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With