Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply function to each cell in DataFrame in place in pandas

Is it possible to apply function to each cell in a DataFrame in place in pandas?

I'm aware of pandas.DataFrame.applymap but it doesn't seem to allow in place application:

import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
format = lambda x: '%.2f' % x
frame = frame.applymap(format)
print(frame)

returns:

               b         d         e
Utah    1.624345 -0.611756 -0.528172
Ohio   -1.072969  0.865408 -2.301539
Texas   1.744812 -0.761207  0.319039
Oregon -0.249370  1.462108 -2.060141

            b      d      e
Utah     1.62  -0.61  -0.53
Ohio    -1.07   0.87  -2.30
Texas    1.74  -0.76   0.32
Oregon  -0.25   1.46  -2.06

frame = frame.applymap(format) will temporarily hold 2 copies of frame in memory, which I don't want.

I know one can apply a function to each cell in place with a NumPy array: Mapping a NumPy array in place.

like image 397
Franck Dernoncourt Avatar asked Jul 06 '17 03:07

Franck Dernoncourt


2 Answers

If it matters a lot to you, you can try making your own cpython function

I found the applymap function in pandas

def applymap(self, func):
      # ...
      def infer(x):
            if x.empty:
                return lib.map_infer(x, func)
            return lib.map_infer(x.asobject, func)

      return self.apply(infer)

which shows that lib.map_infer is doing the work behind the scenes

lib.map_infer is a cython method defined here, you can clearly see them allocating space for a new result: result = np.empty(n, dtype=object), found below

def map_infer(ndarray arr, object f, bint convert=1):
    """
    Substitute for np.vectorize with pandas-friendly dtype inference
    Parameters
    ----------
    arr : ndarray
    f : function
    Returns
    -------
    mapped : ndarray
    """
    cdef:
        Py_ssize_t i, n
        ndarray[object] result
        object val

    n = len(arr)
    result = np.empty(n, dtype=object)
    for i in range(n):
        val = f(util.get_value_at(arr, i))

        # unbox 0-dim arrays, GH #690
        if is_array(val) and PyArray_NDIM(val) == 0:
            # is there a faster way to unbox?
            val = val.item()

        result[i] = val

    if convert:
        return maybe_convert_objects(result,
                                     try_float=0,
                                     convert_datetime=0,
                                     convert_timedelta=0)

return result

... and this is the cliff hangar to my answer. Perhaps the op or someone can modify the cython method to create an inplace version that modifies the original array instead of the creating a new result

-I'm currently away from my coding computer so I can't test anything :(

like image 116
Alter Avatar answered Sep 24 '22 14:09

Alter


If my understanding is correct, pandas inplace operations involve calling an .update_inplace() method, so for example, .replace() would compute the new, replaced data first, then update the dataframe accordingly.

.applymap() is a wrapper of .apply(); neither of these come with inplace options, but even if they did, they would still need to store all the output data in memory before modifying the dataframe.

From the source, .applymap() calls .apply(), which calls .aggregate(), which calls _aggregate(), which calls ._agg(), which is nothing more than a for loop run in Python (i.e. not Cython -- I think).

You could of course, modify the underlying NumPy array directly: the following code rounds the dataframe in place:

frame = pd.DataFrame(np.random.randn(100, 100))

for i in frame.index:
    for j in frame.columns:
        val = round(frame.values[i,j])
        frame.values[i,j] = val

newvals = np.zeros(frame.shape[1])
for i in frame.index:
    for j in frame.columns:
        val = round(frame.values[i,j])
        newvals[j] = val
    frame.values[i] = newvals

The first method sets one element at a time, and takes about 1s, the second sets by row, and takes 100ms; .applymap(round) does it in 20ms.

However, interestingly, if we use frame = pd.DataFrame(np.random.randn(1, 10000)), both the first method and .applymap(round) take about 1.2s, and the second takes about 100ms.

Finally, frame = pd.DataFrame(np.random.randn(10000,1)) has the first and second method taking 1s (unsurprisingly), and .applymap(round) takes 10ms.

These results more or less show that .applymap is essentially iterating over each column.

I tried running frame.applymap(round) with 3 different shapes: (10000,1), (100,100), and (1,10000). The first was fastest, and the third was slowest; this shows that .applymap() iterates over columns. The following code does roughly the same thing as .applymap(), in place:

newvals = np.zeros(frame.shape[1])
for i in frame.index:
    for j in frame.columns:
        val = round(frame.values[i,j])
        newvals[j] = val
    frame.values[i] = newvals

This one works with a copy of the underlying NumPy array:

newvals = np.zeros(frame.shape[1])
arr = frame.values
for i in frame.index:
    for j in frame.columns:
        val = round(arr[i,j])
        newvals[j] = val
        arr[i] = newvals

With a 100x100 dataframe, the former took about 300ms for me to run, and the latter 60ms -- the difference is solely due to having to access .values in the dataframe!

Running the latter in Cython takes about 34ms, whereas .applymap(round) does it in 24ms. I have no idea why .applymap() is still faster here though.

To answer the question: there probably isn't an in-place implementation of .applymap(); if there was, it would most likely involve storing all the 'applied' values before making the in-place change.

If you want to do an .applymap() in-place, you could just iterate over the underlying NumPy array. However, this comes at a cost of performance -- the best solution is likely to iterate over the rows or columns: e.g. assign arr=df.values[i], apply the function on each element of arr, modify the dataframe by df.values[i] = arr, and iterating over all i.

like image 33
Ken Wei Avatar answered Sep 26 '22 14:09

Ken Wei