Let's say I have an array like this: <pre class="prettyprint"><code>import numpy as np base_array = np.array([-13, -9, -11, -3, -3, -4, 2, 2, 2, 5, 7, 7, 8, 7, 12, 11]) </code></pre> Suppose I want to know: "how many elements in <code>base_array</code> are greater than 4?" This can be done simply by exploiting broadcasting: <pre class="prettyprint"><code>np.sum(4 < base_array) </code></pre> For which the answer is <code>7</code>. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value <code>c</code> in the <code>comparison_array</code>, find out how many elements of <code>base_array</code> are greater than <code>c</code>. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly: <pre class="prettyprint"><code>comparison_array = np.arange(-13, 13) comparison_result = np.sum(comparison_array < base_array) </code></pre> Output: <pre class="prettyprint"><code>Traceback (most recent call last): File "<pyshell#87>", line 1, in <module> np.sum(comparison_array < base_array) ValueError: operands could not be broadcast together with shapes (26,) (16,) </code></pre> If I could somehow have each element of <code>comparison_array</code> get broadcast to <code>base_array</code>'s shape, that would solve this. But I don't know how to do such an "element-wise broadcasting". Now, I do know I how to implement this for both cases using list comprehension: <pre class="prettyprint"><code>first = sum([4 </pre> Output: <pre class="prettyprint"><code>7 [15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0] </code></pre> But as we all know, this will be orders of magnitude slower than a correctly-vectorized <code>numpy</code> implementation on larger arrays. So, how should I do this in <code>numpy</code> so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.

You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size <code>(16, 1)</code> (the original array) and <code>(1, 26)</code> (the comparison array) would broadcast to <code>(16, 26)</code>. Don't forget to sum across the dimension of size 16: <pre class="prettyprint"><code>(base_array[:, None] > comparison_array).sum(axis=1) </code></pre> <code>None</code> in a slice is equivalent to <code>np.newaxis</code>: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do <code>comparison_array[None, :]</code> is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.

Element-wise broadcasting for comparing two NumPy arrays?

Tags:

python

arrays

vectorization

numpy

array-broadcasting

Let's say I have an array like this:

import numpy as np

base_array = np.array([-13, -9, -11, -3, -3, -4,   2,  2,
                         2,  5,   7,  7,  8,  7,  12, 11])

Suppose I want to know: "how many elements in base_array are greater than 4?" This can be done simply by exploiting broadcasting:

np.sum(4 < base_array)

For which the answer is 7. Now, suppose instead of comparing to a single value, I want to do this over an array. In other words, for each value c in the comparison_array, find out how many elements of base_array are greater than c. If I do this the naive way, it obviously fails because it doesn't know how to broadcast it properly:

comparison_array = np.arange(-13, 13)
comparison_result = np.sum(comparison_array < base_array)

Output:

Traceback (most recent call last):
  File "<pyshell#87>", line 1, in <module>
    np.sum(comparison_array < base_array)
ValueError: operands could not be broadcast together with shapes (26,) (16,)

If I could somehow have each element of comparison_array get broadcast to base_array's shape, that would solve this. But I don't know how to do such an "element-wise broadcasting".

Now, I do know I how to implement this for both cases using list comprehension:

first = sum([4 < i for i in base_array])
second = [sum([c < i for i in base_array])
          for c in comparison_array]
print(first)
print(second)

Output:

7
[15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10, 7, 7, 7, 6, 6, 3, 2, 2, 2, 1, 0]

But as we all know, this will be orders of magnitude slower than a correctly-vectorized numpy implementation on larger arrays. So, how should I do this in numpy so that it's fast? Ideally this solution should extend to any kind of operation where broadcasting works, not just greater-than or less-than in this example.

966

asked Aug 06 '18 16:08

dain

2 Answers

You can simply add a dimension to the comparison array, so that the comparison is "stretched" across all values along the new dimension.

>>> np.sum(comparison_array[:, None] < base_array)
228

This is the fundamental principle with broadcasting, and works for all kinds of operations.

If you need the sum done along an axis, you just specify the axis along which you want to sum after the comparison.

>>> np.sum(comparison_array[:, None] < base_array, axis=1)
array([15, 15, 14, 14, 13, 13, 13, 13, 13, 12, 10, 10, 10, 10, 10,  7,  7,
        7,  6,  6,  3,  2,  2,  2,  1,  0])

127

answered Oct 20 '22 01:10

miradulo

You will want to transpose one of the arrays for broadcasting to work correctly. When you broadcast two arrays together, the dimensions are lined up and any unit dimensions are effectively expanded to the non-unit size that they match. So two arrays of size (16, 1) (the original array) and (1, 26) (the comparison array) would broadcast to (16, 26).

Don't forget to sum across the dimension of size 16:

(base_array[:, None] > comparison_array).sum(axis=1)

None in a slice is equivalent to np.newaxis: it's one of many ways to insert a new unit dimension at the specified index. The reason that you don't need to do comparison_array[None, :] is that broadcasting lines up the highest dimensions, and fills in the lowest with ones automatically.

answered Oct 20 '22 01:10

Mad Physicist

Related questions
                            
                                Save tensors as images in TensorFlow
                            
                                pyinstaller Recursion error: maximum recursion depth exceeded
                            
                                Regex to match capital/special/unicode/vietnamese characters
                            
                                How to specify a directory in which to save an image using plotly py.image.save_as
                            
                                Auto increment version number in a Python webserver, with git
                            
                                How can I write my own decorator in Django?
                            
                                Vectorizing calculation in matrix with interdependent values
                            
                                plotly: TypeError: cannot convert dictionary update sequence element #0 to a sequence
                            
                                Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__
                            
                                Scikit-learn how to check if model (e.g. TfidfVectorizer) has been already fit
                            
                                Differences between OtpionMenu and ComboBox in tkinter
                            
                                Pandas - Go through 2 columns (latitude and longitude) and find the distance between each coordinate and a specific place
                            
                                How rename pd.value_counts() index with a correspondance dictionary
                            
                                Find similar items in list of dictionaries based on values
                            
                                'module' object has no attribute 'lru_cache'
                            
                                Accuracy Stuck at 50% Keras
                            
                                Block Bootstrapped Sampling in Pandas
                            
                                Cleaning email chain for text analysis python
                            
                                ModuleNotFoundError: No module named 'skimage.util.montage'
                            
                                Change pandas data frame column values inplace

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With