I have a numpy object array containing several lists of index numbers:
>>> idxLsts = np.array([[1], [0, 2]], dtype=object)
I define a vectorized function to append a value to each list:
>>> idx = 99
>>> f = np.vectorize(lambda idxLst: idxLst.append(idx))
I invoke the function. I don't care about the return value, just the side effect.
>>> f(idxLsts)
array([None, None], dtype=object)
The index 99 was added twice to the first list. Why? I'm stumped.
>>> idxLsts
array([[1, 99, 99], [0, 2, 99]], dtype=object)
With other values of idxLsts, it doesn't happen:
>>> idxLsts = np.array([[1, 2], [0, 2, 4]], dtype=object)
>>> f(idxLsts)
array([None, None], dtype=object)
>>> idxLsts
array([[1, 2, 99], [0, 2, 4, 99]], dtype=object)
My suspicion is it's related to the documentation which says: "Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy."
Numpy arrays tout a performance (speed) feature called vectorization. The generally held impression among the scientific computing community is that vectorization is fast because it replaces the loop (running each item one by one) with something else that runs the operation on several items in parallel.
Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
Vectorized implementations (numpy) are much faster and more efficient as compared to for-loops. To really see HOW large the difference is, let's try some simple operations used in most machine learnign algorithms (especially deep learning).
NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.
From the vectorize
docstring:
The data type of the output of `vectorized` is determined by calling the function with the first element of the input. This can be avoided by specifying the `otypes` argument.
And from the code:
theout = self.thefunc(*newargs)
This is an extra call to thefunc
, used to determine the output type. This is why the first element is getting two 99
s appended.
This behavior happens in your second case as well:
import numpy as np
idxLsts = np.array([[1, 2], [0,2,4]], dtype = object)
idx = 99
f = np.vectorize(lambda x: x.append(idx))
f(idxLsts)
print(idxLsts)
yields
[[1, 2, 99, 99] [0, 2, 4, 99]]
You could use np.frompyfunc
instead of np.vectorize
:
import numpy as np
idxLsts = np.array([[1, 2], [0,2,4]], dtype = object)
idx = 99
f = np.frompyfunc(lambda x: x.append(idx), 1, 1)
f(idxLsts)
print(idxLsts)
yields
[[1, 2, 99] [0, 2, 4, 99]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With