Consider the following pd.DataFrame
import numpy as np
import pandas as pd
start_end = pd.DataFrame([[(0, 3), (4, 5), (6, 12)], [(7, 10), (11, 90), (91, 99)]])
values = np.random.rand(1, 99)
The start_end
is a pd.DataFrame
of shape (X, Y)
where each value inside is a tuple of (start_location, end_location)
in the values
vector. Another way of saying that the values in a particular cell is a vector of different lengths.
Question
If I want to find the mean (for example) of the vector values for each of the cells in the pd.DataFrame
, how can I do this in a cost effective way?
I managed to achieve this with an .apply
function, but it's quite slow.
I guess I need to find some way to present it in a numpy
array and then map it back to the 2d data-frame, but I can't figure out how.
Notes
The generalized problem
More generally speaking I this as a recurring problem of how to make a 3d array, where one of the dimensions is not of equal length to a 2d matrix via some transformation function (mean, min, etc.)
Looking at your sample data :
In [64]: start_end
Out[64]:
0 1 2
0 (1, 6) (4, 5) (6, 12)
1 (7, 10) (11, 12) (13, 19)
It is indeed non-overlapping for each row, but not across the entire dataset.
Now, we have np.ufunc.reduceat
that gives us ufunc reduction for each slice :
ufunc(ar[indices[i]: indices[i + 1]])
as long as indices[i] < indices[i+1]
.
So, with ufunc(ar, indices)
, we would get :
[ufunc(ar[indices[0]: indices[1]]), ufunc(ar[indices[1]: indices[2]]), ..]
In our case, for each tuple (x,y)
, we know x<y
. With stacked version, we have :
[(x1,y1), (x2,y2), (x3,y3), ...]
If we flatten, it would be :
[x1,y1,x2,y2,x3,y3, ...]
So, we might not have y1<x2
, but that's okay, because we don't need ufunc reduction for that one and similarly for the pair : y2,x3
. But that's okay as they could be skipped with a stepsize slicing of the final output.
Thus, we would have :
# Inputs : a (1D array), start_end (2D array of shape (N,2))
lens = start_end[:,1]-start_end[:,0]
out = np.add.reduceat(a, start_end.ravel())[::2]/lens
np.add.reduceat()
part gives us the sliced summations. We needed the division by lens
for the average computations.
Sample run -
In [47]: a
Out[47]:
array([0.49264042, 0.00506412, 0.61419663, 0.77596769, 0.50721381,
0.76943416, 0.83570173, 0.2085408 , 0.38992344, 0.64348176,
0.3168665 , 0.78276451, 0.03779647, 0.33456905, 0.93971763,
0.49663649, 0.4060438 , 0.8711461 , 0.27630025, 0.17129342])
In [48]: start_end
Out[48]:
array([[ 1, 3],
[ 4, 5],
[ 6, 12],
[ 7, 10],
[11, 12],
[13, 19]])
In [49]: [np.mean(a[i:j]) for (i,j) in start_end]
Out[49]:
[0.30963037472653104,
0.5072138121177008,
0.5295464559328862,
0.41398199978967815,
0.7827645134019902,
0.5540688880441684]
In [50]: lens = start_end[:,1]-start_end[:,0]
...: out = np.add.reduceat(a, start_end.ravel())[::2]/lens
In [51]: out
Out[51]:
array([0.30963037, 0.50721381, 0.52954646, 0.413982 , 0.78276451,
0.55406889])
For completeness, referring back to given sample, the conversion steps were :
# Given start_end as df and values as a 2D array
start_end = np.vstack(np.concatenate(start_end.values))
a = values.ravel()
For other ufuncs that have reduceat
method, we will just replace np.add.reduceat
For computing mean in your case, you'll never go as fast as if you precompute cumulative sums first using numpy.cumsum for instance. Check out the following code:
import numpy as np
import pandas as pd
import time
R = 1_000
C = 10_000
M = 100
# Generation of test case
start = np.random.randint(0, M-1, (R*C,1))
end = np.random.randint(0, M-1, (R*C,1))
start = np.where(np.logical_and(start>=end, end>1), end-1, start)
end = np.where(np.logical_and(start>=end, start<M-1), start+1, end)
start_end = np.hstack((start, end))
values = np.random.rand(M)
t_start = time.time()
# Basic mean dataframe
lens = start_end[:,1]-start_end[:,0]
mean = np.add.reduceat(values, start_end.ravel())[::2]/lens
print('Timre 1:', time.time()-t_start, 's')
t_start = time.time()
#Cumulative sum
cum_values = np.zeros((values.size+1,))
cum_values[1:] = np.cumsum(values)
# Compute mean dataframe
mean_2 = (cum_values[start_end[:,1]]-cum_values[start_end[:,0]])/(start_end[:,1]-start_end[:,0])
print('Timre 2:', time.time()-t_start, 's')
print('Results are equal!' if np.allclose(mean, mean_2) else 'Results differ!')
print('Norm of the difference:', np.linalg.norm(mean - mean_2))
Output:
% python3 script.py
Timre 1: 0.48940515518188477 s
Timre 2: 0.16983389854431152 s
Results are equal!
Norm of the difference: 2.545241707481022e-12
The difference in performance gets even worse when M
increases. For M=5000
you get:
% python3 script.py
Timre 1: 4.5356669425964355 s
Timre 2: 0.1772768497467041 s
Results are equal!
Norm of the difference: 1.0660592585125616e-10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With