So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code:
df_reversed = df[df.columns[::-1]]
What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?
Reversing the rows of a data frame in pandas can be done in python by invoking the loc() function. The panda's dataframe. loc() attribute accesses a set of rows and columns in the given data frame by either a label or a boolean array.
No, it does not work for missing values. Then you start doing dropna or fillna on various columns that are not matching.
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in descending order, use sort_values(). The size() method is used to get the dataframe size.
I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:
def get_dummy_df(n):
return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})
df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
The output was:
(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop
As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.
So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)
The Big O complexity (as of Pandas 0.24) is m*n
where m
is the number of columns and n
is the number of rows. Note, this is when using the DataFrame.__getitem__
method (aka []
) with an Index
(see relevant code, with other types that would trigger a copy).
Here is a helpful stack trace:
<ipython-input-4-3162cae03863>(2)<module>()
1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]
pandas/core/frame.py(2682)__getitem__()
2681 # either boolean or fancy integer index
-> 2682 return self._getitem_array(key)
2683 elif isinstance(key, DataFrame):
pandas/core/frame.py(2727)_getitem_array()
2726 indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727 return self._take(indexer, axis=1)
2728
pandas/core/generic.py(2789)_take()
2788 axis=self._get_block_manager_axis(axis),
-> 2789 verify=True)
2790 result = self._constructor(new_data).__finalize__(self)
pandas/core/internals.py(4539)take()
4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539 axis=axis, allow_dups=True)
4540
pandas/core/internals.py(4421)reindex_indexer()
4420 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421 fill_tuple=(fill_value,))
4422 else:
pandas/core/internals.py(1254)take_nd()
1253 new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254 allow_fill=False)
1255 else:
> pandas/core/algorithms.py(1658)take_nd()
1657 import ipdb; ipdb.set_trace()
-> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1659 mask_info=mask_info)
1660 func(arr, indexer, out, fill_value)
The func
call on L1660 in pandas/core/algorithms
ultimately calls a cython function with O(m * n)
complexity. This is where data from the the original data is copied into out
. out
contains a copy of the original data in reversed order.
inner_take_2d_axis0_template = """\
cdef:
Py_ssize_t i, j, k, n, idx
%(c_type_out)s fv
n = len(indexer)
k = values.shape[1]
fv = fill_value
IF %(can_copy)s:
cdef:
%(c_type_out)s *v
%(c_type_out)s *o
#GH3130
if (values.strides[1] == out.strides[1] and
values.strides[1] == sizeof(%(c_type_out)s) and
sizeof(%(c_type_out)s) * n >= 256):
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
v = &values[idx, 0]
o = &out[i, 0]
memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
return
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
for j from 0 <= j < k:
out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""
Note that in the above template function, there is a path that uses memmove
(which is the path taken in this case because we are mapping from int64
to int64
and the dimension of the output is identical as we are just switching the indexes). Note that memmove
is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With