Is there any function in numpy to group this array down below by the first column? I couldn't find any good answer over the internet.. <pre class="prettyprint"><code>>>> a array([[ 1, 275], [ 1, 441], [ 1, 494], [ 1, 593], [ 2, 679], [ 2, 533], [ 2, 686], [ 3, 559], [ 3, 219], [ 3, 455], [ 4, 605], [ 4, 468], [ 4, 692], [ 4, 613]]) </code></pre> Wanted output: <pre class="prettyprint"><code>array([[[275, 441, 494, 593]], [[679, 533, 686]], [[559, 219, 455]], [[605, 468, 692, 613]]], dtype=object) </code></pre>

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with <code>a = a[a[:, 0].argsort()]</code>) <pre class="prettyprint"><code>>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:]) [array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])] </code></pre> I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question : <ul> <li>No python native loop</li> <li>Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed</li> <li>Complexity looks O(n) (with sort it goes O(n log(n))</li> </ul> [EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls. <pre class="prettyprint"><code>>>> a = np.random.randint(5, size=(10000, 2)) # 5 different "groups" # Only the sort >>> a = a[a[:, 0].argsort()] ⏱ 116.9 ms # Group by on the already sorted table >>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:]) ⏱ 35.5 ms # Total sort + groupby >>> a = a[a[:, 0].argsort()] >>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:]) ⏱ 153.0 ms 👑 # With numpy-indexed package (cf Eelco answer) >>> npi.group_by(a[:, 0]).split(a[:, 1]) ⏱ 353.3 ms # With pandas (cf Piotr answer) >>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line >>> df.groupby("key").val.apply(pd.Series.tolist) ⏱ 362.3 ms # With defaultdict, the python native way (cf Piotr answer) >>> d = defaultdict(list) for key, val in a: d[key].append(val) ⏱ 3543.2 ms # With numpy_groupies (cf Michael answer) >>> aggregate(a[:,0], a[:,1], "array", fill_value=[]) ⏱ 376.4 ms </code></pre> Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario. <pre class="prettyprint"><code>>>> a = np.random.randint(500, size=(10000, 2)) just the sort 141.1 ms already_sorted 392.0 ms sort+groupby 542.4 ms pandas 2695.8 ms numpy-indexed 800.6 ms defaultdict 3707.3 ms numpy_groupies 836.7 ms </code></pre> [EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library. <pre class="prettyprint"><code>import numpy_indexed as npi npi.group_by(a[:, 0]).split(a[:, 1]) </code></pre> Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Is there any numpy group by function?

Tags:

python

arrays

numpy

Is there any function in numpy to group this array down below by the first column?

I couldn't find any good answer over the internet..

>>> a
array([[  1, 275],
       [  1, 441],
       [  1, 494],
       [  1, 593],
       [  2, 679],
       [  2, 533],
       [  2, 686],
       [  3, 559],
       [  3, 219],
       [  3, 455],
       [  4, 605],
       [  4, 468],
       [  4, 692],
       [  4, 613]])

Wanted output:

array([[[275, 441, 494, 593]],
       [[679, 533, 686]],
       [[559, 219, 455]],
       [[605, 468, 692, 613]]], dtype=object)

390

asked Jun 24 '16 12:06

John Dow

3 Answers

Inspired by Eelco Hoogendoorn's library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with a = a[a[:, 0].argsort()])

>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1][1:])
[array([275, 441, 494, 593]),
 array([679, 533, 686]),
 array([559, 219, 455]),
 array([605, 468, 692, 613])]

I didn't "timeit" ([EDIT] see below) but this is probably the faster way to achieve the question :

No python native loop
Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
Complexity looks O(n) (with sort it goes O(n log(n))

[EDIT sept 2021] I ran timeit on my Macbook M1, for a table of 10k random integers. The duration is for 1000 calls.

>>> a = np.random.randint(5, size=(10000, 2))  # 5 different "groups"

# Only the sort
>>> a = a[a[:, 0].argsort()]
⏱ 116.9 ms

# Group by on the already sorted table
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 35.5 ms

# Total sort + groupby
>>> a = a[a[:, 0].argsort()]
>>> np.split(a[:, 1], np.unique(a[:, 0], return_index=True)[1][1:])
⏱ 153.0 ms 👑

# With numpy-indexed package (cf Eelco answer)
>>> npi.group_by(a[:, 0]).split(a[:, 1])
⏱ 353.3 ms

# With pandas (cf Piotr answer)
>>> df = pd.DataFrame(a, columns=["key", "val"]) # no timer for this line
>>> df.groupby("key").val.apply(pd.Series.tolist) 
⏱ 362.3 ms

# With defaultdict, the python native way (cf Piotr answer)
>>> d = defaultdict(list)
for key, val in a:
    d[key].append(val)
⏱ 3543.2 ms

# With numpy_groupies (cf Michael answer)
>>> aggregate(a[:,0], a[:,1], "array", fill_value=[])
⏱ 376.4 ms

Second timeit scenario, with 500 different groups instead of 5. I'm surprised about pandas, I ran several times, but it just behave badly in this scenario.

>>> a = np.random.randint(500, size=(10000, 2))

just the sort  141.1 ms
already_sorted 392.0 ms
sort+groupby   542.4 ms
pandas        2695.8 ms
numpy-indexed  800.6 ms
defaultdict   3707.3 ms
numpy_groupies 836.7 ms

[EDIT] I improved the answer thanks to ns63sr's answer and Behzad Shayegh (cf comment) Thanks also TMBailey for noticing complexity of argsort is n log(n).

184

answered Oct 22 '22 22:10

Vincent J

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

answered Oct 22 '22 21:10

Eelco Hoogendoorn

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way...

from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
    d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
        {1: [275, 441, 494, 593],
         2: [679, 533, 686],
         3: [559, 219, 455],
         4: [605, 468, 692, 613]})

...or the pandas way:

import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object

answered Oct 22 '22 22:10

Piotr

Related questions
                            
                                How to run Ansible without specifying the inventory but the host directly?
                            
                                How to use Stanford Parser in NLTK using Python
                            
                                how is axis indexed in numpy's array?
                            
                                Insert an item into sorted list in Python
                            
                                ImportError: No module named BeautifulSoup
                            
                                Python: give start and end of week data from a given date
                            
                                How to make python Requests work via socks proxy
                            
                                Interactively validating Entry widget content in tkinter
                            
                                Convert True/False value read from file to boolean
                            
                                Interweaving two numpy arrays
                            
                                Access Lovoo API using Python
                            
                                Is it possible to copy a cell from one jupyter notebook to another?
                            
                                How to parse multiple nested sub-commands using python argparse?
                            
                                What does it mean for an object to be picklable (or pickle-able)?
                            
                                What are the differences between ipython and bpython?
                            
                                Automatically run %matplotlib inline in IPython Notebook
                            
                                Import python package from local directory into interpreter
                            
                                ValueError: could not convert string to float: id
                            
                                selenium with scrapy for dynamic page
                            
                                Union of 2 sets does not contain all items

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With