Suppose I have the following array of arrays:
Input = np.array([[[[17.63, 0. , -0.71, 29.03],
[17.63, -0.09, 0.71, 56.12],
[ 0.17, 1.24, -2.04, 18.49],
[ 1.41, -0.8 , 0.51, 11.85],
[ 0.61, -0.29, 0.15, 36.75]]],
[[[ 0.32, -0.14, 0.39, 24.52],
[ 0.18, 0.25, -0.38, 18.08],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.43, 0. , 0.3 , 0. ]]],
[[[ 0.75, -0.38, 0.65, 19.51],
[ 0.37, 0.27, 0.52, 24.27],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]]]])
Input.shape
(3, 1, 5, 4)
Together with this Input
array is the corresponding Label
array for all input, so that:
Label = np.array([0, 1, 2])
Label.shape
(3,)
I need some way to check with all nested arrays of Input
, to select ONLY the array with sufficient data points.
By this I mean I want a way to eliminate (or should I say delete) all arrays whose entries of the last 3 rows are all zeros. While doing this also, eliminate the corresponding Label
for that array.
Expected output:
Input_filtered
array([[[[17.63, 0. , -0.71, 29.03],
[17.63, -0.09, 0.71, 56.12],
[ 0.17, 1.24, -2.04, 18.49],
[ 1.41, -0.8 , 0.51, 11.85],
[ 0.61, -0.29, 0.15, 36.75]]],
[[[ 0.32, -0.14, 0.39, 24.52],
[ 0.18, 0.25, -0.38, 18.08],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.43, 0. , 0.3 , 0. ]]]])
Label_filtered
array([0, 1])
What's the trick that I need?
In NumPy, you filter an array using a boolean index list. A boolean index list is a list of booleans corresponding to indexes in the array. If the value at an index is True that element is contained in the filtered array, if the value at that index is False that element is excluded from the filtered array.
If arr is a subclass of ndarray, a base class ndarray is returned. Here, we first create a numpy array and a filter with its values to be filtered. To filter we used this fltr in numpy.in1d () method and stored as its values in the original array that return True if condition fulfills.
Following are the list of options available in this Tableau conditional filters section: First Dropdown List: By default, this will select the Measure value present in the record. But you can change the filed by clicking the down arrow. When you click that down arrow, it displays all the Measures, and Dimensions present in the data source.
The reason is dataframe may be having multiple columns and multiple rows. Selective display of columns with limited rows is always the expected view of users. To fulfill the user’s expectations and also help in machine deep learning scenarios, filtering of Pandas dataframe with multiple conditions is much necessary.
If the value at an index is True that element is contained in the filtered array, if the value at that index is False that element is excluded from the filtered array. The example above will return [41, 43], why?
You should be able to do this with vectorized numpy commands only.
filter_ = np.any(Input[:, :, -3:], axis=(1, 2, 3))
labels_filtered = Label[filter_]
inputs_filtered = Input[[filter_]]
For the example set you provided this yields 4.95 µs ± 9.69 ns per loop (100000 loops each) compared to the solution of anon01 with 17.1 µs ± 111 ns per loop (100000 loops each). The Improvment should me even more noteable on larger arrays.
If your data has a different dimension you can change the axis argument. For an arbitrary number of axis it could look like the following:
filter_ = np.any(Input[:, :, -3:], axis=tuple(range(1, Input.ndim)))
The best way to do this depends on the scale of your data. If there are few sub-arrays (thousands or less) you can generate a filter list that is applied to the Label and Input arrays:
filter = []
for j in range(len(Input)):
arr = Input[j,:,-3:]
filter.append(np.any(arr))
Label_filtered = Label[filter]
Input_filtered = Input[[filter]]
A few things to note: the vectorized/numpy bits (Input[j,:,-3]
, np.any(arr)
) are very fast, while the native python iteration and list usage (for j in range
, filter.append
) are very slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With