So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code: <pre class="prettyprint"><code>df_reversed = df[df.columns[::-1]] </code></pre> What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?

I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation: <pre class="prettyprint"><code>def get_dummy_df(n): return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n}) df = get_dummy_df(100) print df.shape %timeit df_r = df[df.columns[::-1]] df = get_dummy_df(1000) print df.shape %timeit df_r = df[df.columns[::-1]] df = get_dummy_df(10000) print df.shape %timeit df_r = df[df.columns[::-1]] df = get_dummy_df(100000) print df.shape %timeit df_r = df[df.columns[::-1]] df = get_dummy_df(1000000) print df.shape %timeit df_r = df[df.columns[::-1]] df = get_dummy_df(10000000) print df.shape %timeit df_r = df[df.columns[::-1]] </code></pre> The output was: <pre class="prettyprint"><code>(200, 3) 1000 loops, best of 3: 419 µs per loop (2000, 3) 1000 loops, best of 3: 425 µs per loop (20000, 3) 1000 loops, best of 3: 498 µs per loop (200000, 3) 100 loops, best of 3: 2.66 ms per loop (2000000, 3) 10 loops, best of 3: 25.2 ms per loop (20000000, 3) 1 loop, best of 3: 207 ms per loop </code></pre> As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time. So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)

The Big O complexity (as of Pandas 0.24) is <code>m*n</code> where <code>m</code> is the number of columns and <code>n</code> is the number of rows. Note, this is when using the <code>DataFrame.__getitem__</code> method (aka <code>[]</code>) with an <code>Index</code> (see relevant code, with other types that would trigger a copy). Here is a helpful stack trace: <pre class="prettyprint"><code> <ipython-input-4-3162cae03863>(2)<module>() 1 columns = df.columns[::-1] ----> 2 df_reversed = df[columns] pandas/core/frame.py(2682)__getitem__() 2681 # either boolean or fancy integer index -> 2682 return self._getitem_array(key) 2683 elif isinstance(key, DataFrame): pandas/core/frame.py(2727)_getitem_array() 2726 indexer = self.loc._convert_to_indexer(key, axis=1) -> 2727 return self._take(indexer, axis=1) 2728 pandas/core/generic.py(2789)_take() 2788 axis=self._get_block_manager_axis(axis), -> 2789 verify=True) 2790 result = self._constructor(new_data).__finalize__(self) pandas/core/internals.py(4539)take() 4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer, -> 4539 axis=axis, allow_dups=True) 4540 pandas/core/internals.py(4421)reindex_indexer() 4420 new_blocks = self._slice_take_blocks_ax0(indexer, -> 4421 fill_tuple=(fill_value,)) 4422 else: pandas/core/internals.py(1254)take_nd() 1253 new_values = algos.take_nd(values, indexer, axis=axis, -> 1254 allow_fill=False) 1255 else: > pandas/core/algorithms.py(1658)take_nd() 1657 import ipdb; ipdb.set_trace() -> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis, 1659 mask_info=mask_info) 1660 func(arr, indexer, out, fill_value) </code></pre> The <code>func</code> call on L1660 in <code>pandas/core/algorithms</code> ultimately calls a cython function with <code>O(m * n)</code> complexity. This is where data from the the original data is copied into <code>out</code>. <code>out</code> contains a copy of the original data in reversed order. <pre class="prettyprint"><code> inner_take_2d_axis0_template = """\ cdef: Py_ssize_t i, j, k, n, idx %(c_type_out)s fv n = len(indexer) k = values.shape[1] fv = fill_value IF %(can_copy)s: cdef: %(c_type_out)s *v %(c_type_out)s *o #GH3130 if (values.strides[1] == out.strides[1] and values.strides[1] == sizeof(%(c_type_out)s) and sizeof(%(c_type_out)s) * n >= 256): for i from 0 <= i < n: idx = indexer[i] if idx == -1: for j from 0 <= j < k: out[i, j] = fv else: v = &values[idx, 0] o = &out[i, 0] memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k)) return for i from 0 <= i < n: idx = indexer[i] if idx == -1: for j from 0 <= j < k: out[i, j] = fv else: for j from 0 <= j < k: out[i, j] = %(preval)svalues[idx, j]%(postval)s """ </code></pre> Note that in the above template function, there is a path that uses <code>memmove</code> (which is the path taken in this case because we are mapping from <code>int64</code> to <code>int64</code> and the dimension of the output is identical as we are just switching the indexes). Note that <code>memmove</code> is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.

What is the Big O Complexity of Reversing the Order of Columns in Pandas DataFrame?

Tags:

python

algorithm

big-o

pandas

numpy

So lets say I have a DataFrame in pandas with a m rows and n columns. Let's also say that I wanted to reverse the order of the columns, which can be done with the following code:

df_reversed = df[df.columns[::-1]]

What is the Big O complexity of this operation? I'm assuming this would depend on the number of columns, but would it also depend on the number of rows?

840

asked Jul 23 '18 19:07

Tim Holdsworth

2 Answers

I don't know how Pandas implements this, but I did test it empirically. I ran the following code (in a Jupyter notebook) to test the speed of the operation:

def get_dummy_df(n):
    return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})

df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]

The output was:

(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop

As you can see, in the first 3 cases, the overhead of the operation is what takes most of the time (400-500µs), but from the 4th case, the time it takes starts to be proportional to the amount of data, increasing in an order of magnitude each time.

So, assuming there must also be a proportion to n, it seems that we are dealing with O(m*n)

163

answered Sep 24 '22 22:09

Shovalt

The Big O complexity (as of Pandas 0.24) is m*n where m is the number of columns and n is the number of rows. Note, this is when using the DataFrame.__getitem__ method (aka []) with an Index (see relevant code, with other types that would trigger a copy).

Here is a helpful stack trace:

 <ipython-input-4-3162cae03863>(2)<module>()
      1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]

  pandas/core/frame.py(2682)__getitem__()
   2681             # either boolean or fancy integer index
-> 2682             return self._getitem_array(key)
   2683         elif isinstance(key, DataFrame):

  pandas/core/frame.py(2727)_getitem_array()
   2726             indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727             return self._take(indexer, axis=1)
   2728 

  pandas/core/generic.py(2789)_take()
   2788                                    axis=self._get_block_manager_axis(axis),
-> 2789                                    verify=True)
   2790         result = self._constructor(new_data).__finalize__(self)

  pandas/core/internals.py(4539)take()
   4538         return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539                                     axis=axis, allow_dups=True)
   4540 

  pandas/core/internals.py(4421)reindex_indexer()
   4420             new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421                                                      fill_tuple=(fill_value,))
   4422         else:

  pandas/core/internals.py(1254)take_nd()
   1253             new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254                                        allow_fill=False)
   1255         else:

> pandas/core/algorithms.py(1658)take_nd()
   1657     import ipdb; ipdb.set_trace()
-> 1658     func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
   1659                                  mask_info=mask_info)
   1660     func(arr, indexer, out, fill_value)

The func call on L1660 in pandas/core/algorithms ultimately calls a cython function with O(m * n) complexity. This is where data from the the original data is copied into out. out contains a copy of the original data in reversed order.

    inner_take_2d_axis0_template = """\
    cdef:
        Py_ssize_t i, j, k, n, idx
        %(c_type_out)s fv

    n = len(indexer)
    k = values.shape[1]

    fv = fill_value

    IF %(can_copy)s:
        cdef:
            %(c_type_out)s *v
            %(c_type_out)s *o

        #GH3130
        if (values.strides[1] == out.strides[1] and
            values.strides[1] == sizeof(%(c_type_out)s) and
            sizeof(%(c_type_out)s) * n >= 256):

            for i from 0 <= i < n:
                idx = indexer[i]
                if idx == -1:
                    for j from 0 <= j < k:
                        out[i, j] = fv
                else:
                    v = &values[idx, 0]
                    o = &out[i, 0]
                    memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
            return

    for i from 0 <= i < n:
        idx = indexer[i]
        if idx == -1:
            for j from 0 <= j < k:
                out[i, j] = fv
        else:
            for j from 0 <= j < k:
                out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""

Note that in the above template function, there is a path that uses memmove (which is the path taken in this case because we are mapping from int64 to int64 and the dimension of the output is identical as we are just switching the indexes). Note that memmove is still O(n), being proportional to the number of bytes it has to copy, although likely faster than writing to the indexes directly.

answered Sep 24 '22 22:09

akosel

Related questions
                            
                                OrderedDict vs Dict in python
                            
                                How to Install M2crypto on Windows
                            
                                Pandas SettingWithCopyWarning When Using loc [duplicate]
                            
                                How can I set the size of the default font loaded by PIL so it fits on my 8x8 matrix?
                            
                                ImportError: cannot import name check_array from sklearn.utils.validation
                            
                                Create UUID on client and save primary key with Django REST Framework and using a POST
                            
                                Django not sending error emails - how can I debug?
                            
                                Logging in Django on Heroku not appearing
                            
                                Distribution of Number of Digits of Random Numbers
                            
                                How can I add labels to TensorBoard Images?
                            
                                how can I asynchronously map/filter an asynchronous iterable?
                            
                                using mattermost api via gitlab oauth as an end-user with username and password (no client_secret)
                            
                                How do I make a custom model Field call to_python when the field is accessed immediately after initialization (not loaded from DB) in Django >=1.10?
                            
                                Get weight matrices from gensim word2Vec
                            
                                Why does __self__ of built-in functions return the builtin module it belongs to?
                            
                                What Does the python -v Command Do
                            
                                Unit tests fail after a Django upgrade
                            
                                When to use multiple event loops?
                            
                                How to get interactive plot of pyplot when using pycharm
                            
                                cProfile adds significant overhead when calling numba jit functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With