Is it possible to apply function to each cell in a DataFrame in place in pandas?
I'm aware of pandas.DataFrame.applymap but it doesn't seem to allow in place application:
import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
format = lambda x: '%.2f' % x
frame = frame.applymap(format)
print(frame)
returns:
b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
b d e
Utah 1.62 -0.61 -0.53
Ohio -1.07 0.87 -2.30
Texas 1.74 -0.76 0.32
Oregon -0.25 1.46 -2.06
frame = frame.applymap(format)
will temporarily hold 2 copies of frame
in
memory, which I don't want.
I know one can apply a function to each cell in place with a NumPy array: Mapping a NumPy array in place.
If it matters a lot to you, you can try making your own cpython function
I found the applymap function in pandas
def applymap(self, func):
# ...
def infer(x):
if x.empty:
return lib.map_infer(x, func)
return lib.map_infer(x.asobject, func)
return self.apply(infer)
which shows that lib.map_infer
is doing the work behind the scenes
lib.map_infer
is a cython method defined here, you can clearly see them allocating space for a new result: result = np.empty(n, dtype=object)
, found below
def map_infer(ndarray arr, object f, bint convert=1):
"""
Substitute for np.vectorize with pandas-friendly dtype inference
Parameters
----------
arr : ndarray
f : function
Returns
-------
mapped : ndarray
"""
cdef:
Py_ssize_t i, n
ndarray[object] result
object val
n = len(arr)
result = np.empty(n, dtype=object)
for i in range(n):
val = f(util.get_value_at(arr, i))
# unbox 0-dim arrays, GH #690
if is_array(val) and PyArray_NDIM(val) == 0:
# is there a faster way to unbox?
val = val.item()
result[i] = val
if convert:
return maybe_convert_objects(result,
try_float=0,
convert_datetime=0,
convert_timedelta=0)
return result
... and this is the cliff hangar to my answer. Perhaps the op or someone can modify the cython method to create an inplace version that modifies the original array instead of the creating a new result
-I'm currently away from my coding computer so I can't test anything :(
If my understanding is correct, pandas inplace operations involve calling an .update_inplace()
method, so for example, .replace()
would compute the new, replaced data first, then update the dataframe accordingly.
.applymap()
is a wrapper of .apply()
; neither of these come with inplace options, but even if they did, they would still need to store all the output data in memory before modifying the dataframe.
From the source, .applymap()
calls .apply()
, which calls .aggregate()
, which calls _aggregate()
, which calls ._agg()
, which is nothing more than a for loop run in Python (i.e. not Cython -- I think).
You could of course, modify the underlying NumPy array directly: the following code rounds the dataframe in place:
frame = pd.DataFrame(np.random.randn(100, 100))
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
frame.values[i,j] = val
newvals = np.zeros(frame.shape[1])
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
newvals[j] = val
frame.values[i] = newvals
The first method sets one element at a time, and takes about 1s, the second sets by row, and takes 100ms; .applymap(round)
does it in 20ms.
However, interestingly, if we use frame = pd.DataFrame(np.random.randn(1, 10000))
, both the first method and .applymap(round)
take about 1.2s, and the second takes about 100ms.
Finally, frame = pd.DataFrame(np.random.randn(10000,1))
has the first and second method taking 1s (unsurprisingly), and .applymap(round)
takes 10ms.
These results more or less show that .applymap
is essentially iterating over each column.
I tried running frame.applymap(round)
with 3 different shapes: (10000,1), (100,100), and (1,10000). The first was fastest, and the third was slowest; this shows that .applymap()
iterates over columns. The following code does roughly the same thing as .applymap()
, in place:
newvals = np.zeros(frame.shape[1])
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
newvals[j] = val
frame.values[i] = newvals
This one works with a copy of the underlying NumPy array:
newvals = np.zeros(frame.shape[1])
arr = frame.values
for i in frame.index:
for j in frame.columns:
val = round(arr[i,j])
newvals[j] = val
arr[i] = newvals
With a 100x100 dataframe, the former took about 300ms for me to run, and the latter 60ms -- the difference is solely due to having to access .values
in the dataframe!
Running the latter in Cython takes about 34ms, whereas .applymap(round)
does it in 24ms. I have no idea why .applymap()
is still faster here though.
To answer the question: there probably isn't an in-place implementation of .applymap()
; if there was, it would most likely involve storing all the 'applied' values before making the in-place change.
If you want to do an .applymap()
in-place, you could just iterate over the underlying NumPy array. However, this comes at a cost of performance -- the best solution is likely to iterate over the rows or columns: e.g. assign arr=df.values[i]
, apply the function on each element of arr
, modify the dataframe by df.values[i] = arr
, and iterating over all i
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With