Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply an operation on a vector with offsets

Consider the following pd.DataFrame

import numpy as np
import pandas as pd

start_end = pd.DataFrame([[(0, 3), (4, 5), (6, 12)], [(7, 10), (11, 90), (91, 99)]])
values = np.random.rand(1, 99)

The start_end is a pd.DataFrame of shape (X, Y) where each value inside is a tuple of (start_location, end_location) in the values vector. Another way of saying that the values in a particular cell is a vector of different lengths.

Question

If I want to find the mean (for example) of the vector values for each of the cells in the pd.DataFrame, how can I do this in a cost effective way?

I managed to achieve this with an .apply function, but it's quite slow.

I guess I need to find some way to present it in a numpy array and then map it back to the 2d data-frame, but I can't figure out how.

Notes

  • The distance between start end can varies and outliers can exist.
  • The cell start / end is always non-overlapping with the other cells (it will be interest to see if this prerequisite affect speed of solution).

The generalized problem

More generally speaking I this as a recurring problem of how to make a 3d array, where one of the dimensions is not of equal length to a 2d matrix via some transformation function (mean, min, etc.)

like image 945
Newskooler Avatar asked Jul 07 '20 13:07

Newskooler


Video Answer


2 Answers

Prospective approach

Looking at your sample data :

In [64]: start_end
Out[64]: 
         0         1         2
0   (1, 6)    (4, 5)   (6, 12)
1  (7, 10)  (11, 12)  (13, 19)

It is indeed non-overlapping for each row, but not across the entire dataset.

Now, we have np.ufunc.reduceat that gives us ufunc reduction for each slice :

ufunc(ar[indices[i]: indices[i + 1]])

as long as indices[i] < indices[i+1].

So, with ufunc(ar, indices), we would get :

[ufunc(ar[indices[0]: indices[1]]), ufunc(ar[indices[1]: indices[2]]), ..]

In our case, for each tuple (x,y), we know x<y. With stacked version, we have :

[(x1,y1), (x2,y2), (x3,y3), ...]

If we flatten, it would be :

[x1,y1,x2,y2,x3,y3, ...]

So, we might not have y1<x2, but that's okay, because we don't need ufunc reduction for that one and similarly for the pair : y2,x3. But that's okay as they could be skipped with a stepsize slicing of the final output.

Thus, we would have :

# Inputs : a (1D array), start_end (2D array of shape (N,2))
lens = start_end[:,1]-start_end[:,0]
out = np.add.reduceat(a, start_end.ravel())[::2]/lens

np.add.reduceat() part gives us the sliced summations. We needed the division by lens for the average computations.

Sample run -

In [47]: a
Out[47]: 
array([0.49264042, 0.00506412, 0.61419663, 0.77596769, 0.50721381,
       0.76943416, 0.83570173, 0.2085408 , 0.38992344, 0.64348176,
       0.3168665 , 0.78276451, 0.03779647, 0.33456905, 0.93971763,
       0.49663649, 0.4060438 , 0.8711461 , 0.27630025, 0.17129342])

In [48]: start_end
Out[48]: 
array([[ 1,  3],
       [ 4,  5],
       [ 6, 12],
       [ 7, 10],
       [11, 12],
       [13, 19]])

In [49]: [np.mean(a[i:j]) for (i,j) in start_end]
Out[49]: 
[0.30963037472653104,
 0.5072138121177008,
 0.5295464559328862,
 0.41398199978967815,
 0.7827645134019902,
 0.5540688880441684]

In [50]: lens = start_end[:,1]-start_end[:,0]
    ...: out = np.add.reduceat(a, start_end.ravel())[::2]/lens

In [51]: out
Out[51]: 
array([0.30963037, 0.50721381, 0.52954646, 0.413982  , 0.78276451,
       0.55406889])

For completeness, referring back to given sample, the conversion steps were :

# Given start_end as df and values as a 2D array
start_end = np.vstack(np.concatenate(start_end.values)) 
a = values.ravel()  

For other ufuncs that have reduceat method, we will just replace np.add.reduceat

like image 84
Divakar Avatar answered Oct 24 '22 23:10

Divakar


For computing mean in your case, you'll never go as fast as if you precompute cumulative sums first using numpy.cumsum for instance. Check out the following code:

import numpy as np
import pandas as pd
import time

R = 1_000
C = 10_000
M = 100

# Generation of test case
start = np.random.randint(0, M-1, (R*C,1))
end = np.random.randint(0, M-1, (R*C,1))
start = np.where(np.logical_and(start>=end, end>1), end-1, start)
end = np.where(np.logical_and(start>=end, start<M-1), start+1, end)
start_end = np.hstack((start, end))

values = np.random.rand(M)

t_start = time.time()
# Basic mean dataframe
lens = start_end[:,1]-start_end[:,0]
mean = np.add.reduceat(values, start_end.ravel())[::2]/lens
print('Timre 1:', time.time()-t_start, 's')

t_start = time.time()
#Cumulative sum
cum_values = np.zeros((values.size+1,))
cum_values[1:] = np.cumsum(values)
# Compute mean dataframe
mean_2 = (cum_values[start_end[:,1]]-cum_values[start_end[:,0]])/(start_end[:,1]-start_end[:,0])
print('Timre 2:', time.time()-t_start, 's')

print('Results are equal!' if np.allclose(mean, mean_2) else 'Results differ!')
print('Norm of the difference:', np.linalg.norm(mean - mean_2))

Output:

% python3 script.py
Timre 1: 0.48940515518188477 s
Timre 2: 0.16983389854431152 s
Results are equal!
Norm of the difference: 2.545241707481022e-12

The difference in performance gets even worse when M increases. For M=5000 you get:

% python3 script.py
Timre 1: 4.5356669425964355 s
Timre 2: 0.1772768497467041 s
Results are equal!
Norm of the difference: 1.0660592585125616e-10
like image 23
bousof Avatar answered Oct 24 '22 22:10

bousof