I've noticed that using len
on a DataFrame is far quicker than using len
on the underlying numpy array. I don't understand why. Accessing the same information via shape
isn't any help either. This is more relevant as I try to get at the number of columns and number of rows. I was always debating which method to use.
I put together the following experiment and it's very clear that I will be using len
on the dataframe. But can someone explain why?
from timeit import timeit
import pandas as pd
import numpy as np
ns = np.power(10, np.arange(6))
results = pd.DataFrame(
columns=ns,
index=pd.MultiIndex.from_product(
[['len', 'len(values)', 'shape'],
ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}
for n, m in dfs.keys():
df = dfs[(n, m)]
results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)
fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
r, c = i // 3, i % 3
col.unstack(0).plot.bar(ax=axes[r, c], title=m)
From looking at the various methods, the main reason is that constructing the numpy array df.values
takes the lion's share of the time.
len(df)
and df.shape
These two are fast because they are essentially
len(df.index._data)
and
(len(df.index._data), len(df.columns._data))
where _data
is a numpy.ndarray
. Thus, using df.shape
should be half as fast as len(df)
because it's finding the length of both df.index
and df.columns
(both of type pd.Index
)
len(df.values)
and df.values.shape
Let's say you had already extracted vals = df.values
. Then
In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))
In [2]: vals = df.values
In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop
In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop
Compared to:
In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop
So the bottleneck is not len
but how df.values
is constructed. If you examine pandas.DataFrame.values()
, you'll find the (roughly equivalent) methods:
def values(self):
return self.as_matrix()
def as_matrix(self, columns=None):
self._consolidate_inplace()
if self._AXIS_REVERSED:
return self._data.as_matrix(columns).T
if len(self._data.blocks) == 0:
return np.empty(self._data.shape, dtype=float)
if columns is not None:
mgr = self._data.reindex_axis(columns, axis=0)
else:
mgr = self._data
if self._data._is_single_block or not self._data.is_mixed_type:
return mgr.blocks[0].get_values()
else:
dtype = _interleaved_dtype(self.blocks)
result = np.empty(self.shape, dtype=dtype)
if result.shape[0] == 0:
return result
itemmask = np.zeros(self.shape[0])
for blk in self.blocks:
rl = blk.mgr_locs
result[rl.indexer] = blk.get_values(dtype)
itemmask[rl.indexer] = 1
# vvv here is your final array assuming you actually have data
return result
def _consolidate_inplace(self):
def f():
if self._data.is_consolidated():
return self._data
bm = self._data.__class__(self._data.blocks, self._data.axes)
bm._is_consolidated = False
bm._consolidate_inplace()
return bm
self._protect_consolidate(f)
def _protect_consolidate(self, f):
blocks_before = len(self._data.blocks)
result = f()
if len(self._data.blocks) != blocks_before:
if i is not None:
self._item_cache.pop(i, None)
else:
self._item_cache.clear()
return result
Note that df._data
is a pandas.core.internals.BlockManager
, not a numpy.ndarray
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With