Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy: How to avoid this loop?

Is there a way to avoid this loop so optimize the code?

import numpy as np

cLoss = 0
dist_ = np.array([0,1,0,1,1,0,0,1,1,0]) # just an example, longer in reality
TLabels = np.array([-1,1,1,1,1,-1,-1,1,-1,-1]) # just an example, longer in reality
t = float(dist_.size)
for i in range(len(dist_)):
    labels = TLabels[dist_ == dist_[i]]
    cLoss+= 1 - TLabels[i]*(1. * np.sum(labels)/t)
print cLoss

Note: dist_ and TLabels are both numpy arrays with the same shape (t,1)

like image 225
farhawa Avatar asked Jun 07 '15 10:06

farhawa


People also ask

How do I stop Numpy from printing in scientific notation?

Use numpy. set_printoptions() to print an array without scientific notation. Call set_printoptions(suppress=True) to suppress scientific notation when printing.

How do you replace a loop in python?

The map() function is a replacement to a for a loop. It applies a function for each element of an iterable. The map() function accepts two arguments: A function that is applied for each element in the list (a lambda expression)

Why Numpy is faster than for loop?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Does Numpy vectorize fast?

Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.”


2 Answers

I am not sure what you exactly want to do, but are you aware of scipy.ndimage.measurements for computing on arrays with labels? It look like you want something like:

cLoss =  len(dist_) - sum(TLabels * scipy.ndimage.measurements.sum(TLabels,dist_,dist_) / len(dist_))
like image 183
Thomas Baruchel Avatar answered Oct 11 '22 03:10

Thomas Baruchel


I first wonder, what is labels at each step in the loop?

With dist_ = array([2,1,2]) and TLabels=array([1,2,3])

I get

[-1  1]
[1]
[-1  1]

The different length immediately raise a warning flag - it may be difficult to vectorize this.

With the longer arrays in the edited example

[-1  1 -1 -1 -1]
[ 1  1  1  1 -1]
[-1  1 -1 -1 -1]
[ 1  1  1  1 -1]
[ 1  1  1  1 -1]
[-1  1 -1 -1 -1]
[-1  1 -1 -1 -1]
[ 1  1  1  1 -1]
[ 1  1  1  1 -1]
[-1  1 -1 -1 -1]

The labels vectors are all the same length. Is that normal, or just a coincidence of values?

Drop a couple of elements off of dist_, and labels are:

In [375]: for i in range(len(dist_)):
        labels = TLabels[dist_ == dist_[i]]
        v = (1.*np.sum(labels)/t); v1 = 1-TLabels[i]*v
        print(labels, v, TLabels[i], v1)
        cLoss += v1
   .....:     
(array([-1,  1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1,  1, -1, -1]), -0.25, 1, 1.25)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)
(array([-1,  1, -1, -1]), -0.25, -1, 0.75)
(array([-1,  1, -1, -1]), -0.25, -1, 0.75)
(array([1, 1, 1, 1]), 0.5, 1, 0.5)

Again different lengths of labels, but really only a few calculations. There is 1 v value for each different dist_ value.

Without working out all the details, it looks like you are just calculating labels*labels for each distinct dist_ value, and then summing those.

This looks like a groupBy problem. You want to divide the dist_ into groups with a common value, and sum some function of their corresponding TLabels values. Python itertools has a groupBy function, so does pandas. I think both require you to sort dist_.

Try sorting dist_ and see if that adds any clarity to the problem.

like image 31
hpaulj Avatar answered Oct 11 '22 01:10

hpaulj