Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing/removing loop

I have the following piece of code that I would like to optimize using numpy, preferably removing the loop. I can't see how to approach it, so any suggesting would be helpful.

indices is a (N,2) numpy array of integers, N can be a few millions. What the code does is finding the repeated indices in the first column. For these indices I make all the combinations of two of the corresponding indices in the second column. Then I collect them together with the index in the first column.

index_sets = []
uniques, counts = np.unique(indices[:,0], return_counts=True)
potentials = uniques[counts > 1]
for p in potentials:
    correspondents = indices[(indices[:,0] == p),1]
    combs = np.vstack(list(combinations(correspondents, 2)))
    combs = np.hstack((np.tile(p, (combs.shape[0], 1)), combs))
    index_sets.append(combs)
like image 256
martinako Avatar asked Oct 19 '22 06:10

martinako


1 Answers

Few improvements could be suggested :

  • Initialize output array, for which we can pre-calculate the estimated number of rows needed for storing combinations corresponding to each group. We know that with N elements, the total number of possible combinations would be N*(N-1)/2, to give us the combination lengths for each group. Furthermore, the total number of rows in the output array would be sum of all those interval lengths.

  • Pre-calcuate as many stuffs as possible in a vectorized manner before going into a loop.

  • Use a loop to get the combinations, which because of the ragged pattern can't be vectorized. Use np.repeat to simulate tiling and do it before loop to give us the first element for each group and thus the first column of the output array.

So, with all those improvements in mind, an implementation would look like this -

# Remove rows with counts == 1 
_,idx, counts = np.unique(indices[:,0], return_index=True, return_counts=True)
indices = np.delete(indices,idx[counts==1],axis=0)

# Decide the starting indices of corresponding to start of new groups 
# charaterized by new elements along the sorted first column
start_idx = np.unique(indices[:,0], return_index=True)[1]
all_idx = np.append(start_idx,indices.shape[0])

# Get interval lengths that are required to store pairwise combinations
# of each group for unique ID from column-0
interval_lens = np.array([item*(item-1)/2 for item in np.diff(all_idx)])

# Setup output array and set the first column as a repeated array
out = np.zeros((interval_lens.sum(),3),dtype=int)
out[:,0] = np.repeat(indices[start_idx,0],interval_lens)

# Decide the start-stop indices for storing into output array 
ssidx = np.append(0,np.cumsum(interval_lens))

# Finally run a loop gto store all the combinations into initialized o/p array
for i in range(idx.size):
    out[ssidx[i]:ssidx[i+1],1:] = \
    np.vstack(combinations(indices[all_idx[i]:all_idx[i+1],1],2))

Please note that the output array would be a big (M, 3) shaped array and not split into list of arrays as produced by the original code. If still needed as such, one can use np.split for the same.

Also, quick runtime tests suggest that there isn't much improvement with the proposed code. So, probably bulk of the runtime is spent in getting the combinations. Thus, it seems alternative approach with networkx that is specially suited for such connection based problems might be a better fit.

like image 163
Divakar Avatar answered Oct 22 '22 11:10

Divakar