I've observed today that selecting two or more columns of Data frame may be much slower than selecting only one.
If I use loc, or iloc to choose more than one column and I use list to pass column names or indexes, then performance drops 100 times in comparison to single column or many column selection with iloc (but no list passed)
examples:
df = pd.DataFrame(np.random.randn(10**7,10), columns=list('abcdefghij'))
One column selection:
%%timeit -n 100
df['b']
3.17 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
df.iloc[:,1]
66.7 µs ± 5.95 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
df.loc[:,'b']
44.2 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Two columns selection:
%%timeit -n 10
df[['b', 'c']]
96.4 ms ± 788 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.loc[:,['b', 'c']]
99.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.iloc[:,[1,2]]
97.6 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Only this selection works like expected: [EDIT]
%%timeit -n 100
df.iloc[:,1:3]
103 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
What are the differences in mechanisms and why they are so big?
[EDIT]: As @run-out pointed out, pd.Series seems to be processed much faster than pd.DataFrame, anyone knows why it is the case?
On the other hand - it does not explain difference between df.iloc[:,[1,2]]
and df.iloc[:,1:3]
You can use the filter function of the pandas dataframe to select columns containing a specified string in column names. The parameter like of the . filter function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.
This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or Dataframe (using to_numpy() ) into the function.
Pandas works with single rows or columns as a pandas.Series, which would be faster than working within the DataFrame architecture.
Pandas works with pandas.Series when you ask for:
%%timeit -n 10
df['b']
2.31 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, I can call a DataFrame for the same column by putting it in a list. Then you get:
%%timeit -n 10
df[['b']]
90.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can see from the above that it's the Series that is outperforming the DataFrame.
Here is how Pandas is working with column 'b'.
type(df['b'])
pandas.core.series.Series
type(df[['b']])
pandas.core.frame.DataFrame
EDIT: I'm expanding on my answer as OP wants to dig deeper into why there is greater speed is so much greater for pd.series vs. pd.dataframe. And also as this is a great question to expand my/our understanding of how the underlying technology works. Those with more expertise please chime in.
First let's start with numpy as it's a building block of pandas. According to Wes McKinney, author of pandas and from Python for Data Analysis, the performance pick up in numpy over python:
This is based partly on performance differences having to do with the
cache hierarchy of the CPU; operations accessing contiguous blocks of memory (e.g.,
summing the rows of a C order array) will generally be the fastest because the mem‐
ory subsystem will buffer the appropriate blocks of memory into the ultrafast L1 or
L2 CPU cache.
Let's see the speed difference for this example. Let's make a numpy array from column 'b' of the dataframe.
a = np.array(df['b'])
And now do the performance test:
%%timeit -n 10
a
The results are:
32.5 ns ± 28.2 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's a serious pick up in performance over the pd.series time of 2.31 µs.
The other main reason for performance pickup is that numpy indexing goes straight into NumPy C extensions, but there is a lot of python stuff going on when you index into a Series, and this is a lot slower. (read this article)
Let's look at the question of why does:
df.iloc[:,1:3]
drastically outperform:
df.iloc[:,[1,2]]
It's interesting to note that .loc has the same effect with performance as .iloc in this scenario.
Our first big clue that something is not right is in the following code:
df.iloc[:,1:3] is df.iloc[:,[1,2]]
False
These give the same result, but are different objects. I've done a deep dive try to find out what the difference is. I was unable to find reference to this on the internet or in my library of books.
Looking at the source code, we can start to see some difference. I refer to indexing.py.
In the Class _iLocIndexer we can find some extra work being done by pandas for list in an iloc slice.
Right away, we run into these two difference when checking input:
if isinstance(key, slice):
return
vs.
elif is_list_like_indexer(key):
# check that the key does not exceed the maximum size of the index
arr = np.array(key)
l = len(self.obj._get_axis(axis))
if len(arr) and (arr.max() >= l or arr.min() < -l):
raise IndexError("positional indexers are out-of-bounds")
Could this alone be cause enough for the reduced performance? I don't know.
Although .loc is slightly different, it also suffers performance when using a list of values. Looking in index.py, look at def _getitem_axis(self, key, axis=None): --> in class _LocIndexer(_LocationIndexer):
The code section for is_list_like_indexer(key) that handles list inputs is quite long including a lot of overhead. It contains the note:
# convert various list-like indexers
# to a list of keys
# we will use the *values* of the object
# and NOT the index if its a PandasObject
Certainly there is enough additional overhead in dealing with a list of values or integers then direct slices to cause delays in processing.
The rest of the code is past my pay grade. If anyone can have a look and chime it, it would be most welcome
I found this probably rooted in numpy.
numpy has two kind of indexing:
according to the documentation,
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
So if you check
a=df.values
%timeit -n2 a[:,0:3]
%timeit -n2 a[:,[0,1,2]]
you got
The slowest run took 5.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1.57 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 2 loops each)
188 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)
quite similar behavior to pandas dataframe
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With