Let's say i have an arbitrary array np.array([1,2,3,4,5,6]) with another array that maps specific elements in the array to a group, np.array(['a','b', 'a','c','c', 'b']) and now I want to seperate them into three different array depending on the label/group given in the second array, so that they are a,b,c = narray([1,3]), narray([2,6]), narray([4,5]). Is a simple forloop the way to go or is there some efficient method I'm missing here?
When you write efficient, I assume that what you want here is actually fast.
I will try to discuss briefly asymptotic efficiency.
In this context, we refer to N as the input size and K as the number of unique values.
My approach solution would be to use a combination of np.argsort() and a custom-built groupby_np() specifically optimized for NumPy inputs:
import numpy as np
def groupby_np(arr, both=True):
n = len(arr)
extrema = np.nonzero(arr[:-1] != arr[1:])[0] + 1
if both:
last_i = 0
for i in extrema:
yield last_i, i
last_i = i
yield last_i, n
else:
yield 0
yield from extrema
yield n
def labeling_groupby_np(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for i, j in groupby_np(sorted_labels, True):
result[sorted_labels[i]] = sorted_values[i:j]
return result
This has complexity O(N log N + K).
The N log N comes from the sorting step and the K comes from the last loop.
The interesting part is that the both the N-dependent and the K-dependent steps are fast because the N-dependent part is executed at low level, and the K-dependent part is O(1) and also fast.
A solution like the following (very similar to @theEpsilon answer):
import numpy as np
def labeling_loop(values, labels):
labeled = {}
for x, l in zip(values, labels):
if l not in labeled:
labeled[l] = [x]
else:
labeled[l].append(x)
return {k: np.array(v) for k, v in labeled.items()}
uses two loops and has O(N + K). I do not think you can easily avoid the second loop (without a significant speed penalty). As for the first loop, this is executed in Python, which carries a significant speed penalty on its own.
Another possibility is to use np.unique() which brings the main loop to a lower level. However, this brings other challenges, because once the unique values are extracted, there is no efficient way of extracting the information to construct the arrays you want without some NumPy advanced indexing, which is O(N). The overall complexity of these solutions is then O(K * N), but because the NumPy advanced indexing is done at lower level, this can land to relatively fast solution, although with a worse asymptotic complexity than alternatives.
Possible implementations include (similar to @AjayVerma's and @AKX's answers):
import numpy as np
def labeling_unique_bool(values, labels):
return {l: values[l == labels] for l in np.unique(labels)}
import numpy as np
def labeling_unique_nonzero(values, labels):
return {l: values[np.nonzero(l == labels)] for l in np.unique(labels)}
Additionally, one could consider a pre-sorting step to then speed up the slicing part by avoiding NumPy advanced indexing. However, the sorting step can be more costly than the advanced indexing, and in general the proposed approach tends to be faster for the input I tested.
import numpy as np
def labeling_unique_argsort(values, labels):
uniques, counts = np.unique(labels, return_counts=True)
sorted_values = values[labels.argsort()]
bound = 0
result = {}
for x, c in zip(uniques, counts):
result[x] = sorted_values[bound:bound + c]
bound += c
return result
Another approach, which is neat in principle (same as my proposed approach), but slow in practice would be to use sorting and itertools.groupby():
import itertools
from operator import itemgetter
def labeling_groupby(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for x, g in itertools.groupby(zip(sorted_labels, sorted_values), itemgetter(0)):
result[x] = np.fromiter(map(itemgetter(1), g), dtype=sorted_values.dtype)
return result
Finally, a Pandas based approach, which is quite concise and reasonably fast for larger inputs, but under-performing for smaller ones (similar to @Ehsan's answer):
def labeling_groupby_pd(values, labels):
df = pd.DataFrame({'values': values, 'labels': labels})
return df.groupby('labels').values.apply(lambda x: x.values).to_dict()
Now, talking is cheap, so let us attach some numbers to fast and slow and produce some plots for varying input sizes. The value of K is capped to 52 (lower and upper case letters of the English alphabet). When N is much larger than K, the probability of reaching capping value is very high.
Input is generated programmatically with the following:
def gen_input(n, p, labels=string.ascii_letters):
k = len(labels)
values = np.arange(n)
labels = np.array([string.ascii_letters[i] for i in np.random.randint(0, int(k * p), n)])
return values, labels
and the benchmarks are produced for values of p from (1.0, 0.5, 0.1, 0.05), which change the maximum value of K. The plots below refer to the p values in that order.
p=1.0 (at most K = 52)

...and zoomed on the fastest

p=0.5 (at most K = 26)

p=0.1 (at most K = 5)

p=0.05 (at most K = 2)

...and zoomed on the fastest

One can see how the proposed method, except for very small inputs, outperforms the other methods proposed so far for the tested inputs.
(Full benchmarks available here).
One may also consider moving some parts of the looping to Numba / Cython, but I'd leave this to the interested reader.
You can use numpy.unique
x = np.array([1,2,3,4,5,6])
y = np.array(['a','b', 'a','c','c', 'b'])
print({value:x[y==value] for value in np.unique(y)})
Output
{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With