Is there any function in numpy to group this array down below by the first column?
I couldn't find any good answer over the internet..
>>> a
array([[ 1, 275],
[ 1, 441],
[ 1, 494],
[ 1, 593],
[ 2, 679],
[ 2, 533],
[ 2, 686],
[ 3, 559],
[ 3, 219],
[ 3, 455],
[ 4, 605],
[ 4, 468],
[ 4, 692],
[ 4, 613]])
Wanted output:
array([[[275, 441, 494, 593]],
[[679, 533, 686]],
[[559, 219, 455]],
[[605, 468, 692, 613]]], dtype=object)
Pandas' GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis.
Can an array store different data types? Yes, a numpy array can store different data String, Integer, Complex, Float, Boolean.
Having a data type (dtype) is one of the key features that distinguishes NumPy arrays from lists. In lists, the types of elements can be mixed.
We can use NumPy np. array tolist() function to convert an array to a list. If the array is multi-dimensional, a nested list is returned. For a one-dimensional array, a list with the array elements is returned.
Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()]
)
>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])]
I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :
[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.
>>> a = np.random.randint(5, size=(10000, 2)) # 5 different "groups"
# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms
# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms
# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 👑
# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms
# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist)
⏱ 362.3 ms
# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
d[key].append(val)
⏱ 3543.2 ms
# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms
Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.
>>> a = np.random.randint(500, size=(10000, 2))
just the sort 141.1 ms
already_sorted 392.0 ms
sort+groupby 542.4 ms
pandas 2695.8 ms
numpy-indexed 800.6 ms
defaultdict 3707.3 ms
numpy_groupies 836.7 ms
[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).
The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.
import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])
Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.
Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).
I suggest either the pure Python way...
from collections import defaultdict
%%timeit
d = defaultdict(list)
for key, val in a:
d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# result:
defaultdict(list,
{1: [275, 441, 494, 593],
2: [679, 533, 686],
3: [559, 219, 455],
4: [605, 468, 692, 613]})
...or the pandas way:
import pandas as pd
%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# result:
key
1 [275, 441, 494, 593]
2 [679, 533, 686]
3 [559, 219, 455]
4 [605, 468, 692, 613]
Name: val, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With