I have a numpy array like this: <code>[1 2 2 0 0 1 3 5]</code> Is it possible to get the index of the elements as a 2d array? For instance the answer for the above input would be <code>[[3 4], [0 5], [1 2], [6], [], [7]]</code> Currently I have to loop the different values and call <code>numpy.where(input == i)</code> for each value, which has terrible performance with a big enough input.

Here is an O(max(x)+len(x)) approach using <code>scipy.sparse</code>: <pre class="prettyprint"><code>import numpy as np from scipy import sparse x = np.array("1 2 2 0 0 1 3 5".split(),int) x # array([1, 2, 2, 0, 0, 1, 3, 5]) M,N = x.max()+1,x.size sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist() # [[3, 4], [0, 5], [1, 2], [6], [], [7]] </code></pre> This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the <code>CSC</code> (compressed sparse column) format this is rather simple. The matrix is then converted to <code>LIL</code> (linked list) format. This format stores the column indices for each row as a list in its <code>rows</code> attribute, so all we need to do is take that and convert it to list. Note that for small arrays <code>argsort</code> based solutions are probably faster but at some not insanely large size this will cross over. EDIT: <code>argsort</code>-based <code>numpy</code>-only solution: <pre class="prettyprint"><code>np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum()) # [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])] </code></pre> If the order of indices within groups doesn't matter you can also try <code>argpartition</code> (it happens to make no difference in this small example but this is not guaranteed in general): <pre class="prettyprint"><code>bb = np.bincount(x)[:-1].cumsum() np.split(x.argpartition(bb),bb) # [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])] </code></pre> EDIT: @Divakar recommends against the use of <code>np.split</code>. Instead, a loop is probably faster: <pre class="prettyprint"><code>A = x.argsort(kind="stable") B = np.bincount(x+1).cumsum() [A[B[i-1]:B[i]] for i in range(1,len(B))] </code></pre> Or you could use the brand new (Python3.8+) walrus operator: <pre class="prettyprint"><code>A = x.argsort(kind="stable") B = np.bincount(x) L = 0 [A[L:(L:=L+b)] for b in B.tolist()] </code></pre> EDIT(EDITED): (Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran. Compile with <code>pythran -O3 <filename.py></code> <pre class="prettyprint"><code>import numpy as np #pythran export sort_to_bins(int[:],int) def sort_to_bins(idx, mx): if mx==-1: mx = idx.max() + 1 cnts = np.zeros(mx + 2, int) for i in range(idx.size): cnts[idx[i] + 2] += 1 for i in range(3, cnts.size): cnts[i] += cnts[i-1] res = np.empty_like(idx) for i in range(idx.size): res[cnts[idx[i]+1]] = i cnts[idx[i]+1] += 1 return [res[cnts[i]:cnts[i+1]] for i in range(mx)] </code></pre> Here <code>numba</code> wins by a whisker performance-wise: <pre class="prettyprint"><code>repeat(lambda:enum_bins_numba_buffer(x),number=10) # [0.6235917090671137, 0.6071486569708213, 0.6096088469494134] repeat(lambda:sort_to_bins(x,-1),number=10) # [0.6235359431011602, 0.6264424560358748, 0.6217901279451326] </code></pre> Older stuff: <pre class="prettyprint"><code>import numpy as np #pythran export bincollect(int[:]) def bincollect(a): o = [[] for _ in range(a.max()+1)] for i,j in enumerate(a): o[j].append(i) return o </code></pre> Timings vs. numba (old) <pre class="prettyprint"><code>timeit(lambda:bincollect(x),number=10) # 3.5732191529823467 timeit(lambda:enumerate_bins(x),number=10) # 6.7462647299980745 </code></pre>

Although the request is for a <code>numpy</code> solution, I decided to see whether there is an interesting <code>numba</code>-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the <code>argsort</code> approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.) <pre class="prettyprint"><code>@numba.jit(numba.void(numba.int64[:], numba.int64[:], numba.int64[:]), nopython=True) def enum_bins_numba_buffer_inner(ints, bins, starts): for x in range(len(ints)): i = ints[x] bins[starts[i]] = x starts[i] += 1 @numba.jit(nopython=False) # Not 100% sure this does anything... def enum_bins_numba_buffer(ints): ends = np.bincount(ints).cumsum() starts = np.empty(ends.shape, dtype=np.int64) starts[1:] = ends[:-1] starts[0] = 0 bins = np.empty(ints.shape, dtype=np.int64) enum_bins_numba_buffer_inner(ints, bins, starts) starts[1:] = ends[:-1] starts[0] = 0 return [bins[s:e] for s, e in zip(starts, ends)] </code></pre> This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python. For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly. This wrestles with <code>numba</code>'s type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above. <pre class="prettyprint"><code>@numba.jit(nopython=True) def enum_bins_numba(ints): bins = numba.typed.List() for i in range(ints.max() + 1): inner = numba.typed.List() inner.append(0) # An awkward way of forcing type inference. inner.pop() bins.append(inner) for x, i in enumerate(ints): bins[i].append(x) return bins </code></pre> I tested these against the following: <pre class="prettyprint"><code>def enum_bins_dict(ints): enum_bins = defaultdict(list) for k, v in enumerate(ints): enum_bins[v].append(k) return enum_bins def enum_bins_list(ints): enum_bins = [[] for i in range(ints.max() + 1)] for x, i in enumerate(ints): enum_bins[i].append(x) return enum_bins def enum_bins_sparse(ints): M, N = ints.max() + 1, ints.size return sparse.csc_matrix((ints, ints, np.arange(N + 1)), (M, N)).tolil().rows.tolist() </code></pre> I also tested them against a precompiled cython version similar to <code>enum_bins_numba_buffer</code> (described in detail below). On a list of ten million random ints (<code>ints = np.random.randint(0, 100, 10000000)</code>) I get the following results: <pre class="prettyprint"><code>enum_bins_dict(ints) 3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) enum_bins_list(ints) 3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) enum_bins_sparse(ints) 1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) enum_bins_numba(ints) 693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) enum_bins_cython(ints) 82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) enum_bins_numba_buffer(ints) 77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> Impressively, this way of working with <code>numba</code> outperforms a <code>cython</code> version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with <code>pythran</code> to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the <code>pythran</code> version might also be quite a bit faster with this approach. Here's the <code>cython</code> version for reference, with some build instructions. Once you have <code>cython</code> installed, you'll need a simple <code>setup.py</code> file like this: <pre class="prettyprint"><code>from distutils.core import setup from distutils.extension import Extension from Cython.Build import cythonize import numpy ext_modules = [ Extension( 'enum_bins_cython', ['enum_bins_cython.pyx'], ) ] setup( ext_modules=cythonize(ext_modules), include_dirs=[numpy.get_include()] ) </code></pre> And the cython module, <code>enum_bins_cython.pyx</code>: <pre class="prettyprint"><code># cython: language_level=3 import cython import numpy cimport numpy @cython.boundscheck(False) @cython.cdivision(True) @cython.wraparound(False) cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil: cdef long i, x for x in range(len(ints)): i = ints[x] bins[starts[i]] = x starts[i] = starts[i] + 1 def enum_bins_cython(ints): assert (ints >= 0).all() # There might be a way to avoid storing two offset arrays and # save memory, but `enum_bins_inner` modifies the input, and # having separate lists of starts and ends is convenient for # the final partition stage. ends = numpy.bincount(ints).cumsum() starts = numpy.empty(ends.shape, dtype=numpy.int64) starts[1:] = ends[:-1] starts[0] = 0 bins = numpy.empty(ints.shape, dtype=numpy.int64) enum_bins_inner(ints, bins, starts) starts[1:] = ends[:-1] starts[0] = 0 return [bins[s:e] for s, e in zip(starts, ends)] </code></pre> With these two files in your working directory, run this command: <pre class="prettyprint"><code>python setup.py build_ext --inplace </code></pre> You can then import the function using <code>from enum_bins_cython import enum_bins_cython</code>.

Numpy: get the index of the elements of a 1d array as a 2d array

2 Answers

Here is an O(max(x)+len(x)) approach using scipy.sparse:

import numpy as np
from scipy import sparse

x = np.array("1 2 2 0 0 1 3 5".split(),int)
x
# array([1, 2, 2, 0, 0, 1, 3, 5])


M,N = x.max()+1,x.size
sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist()
# [[3, 4], [0, 5], [1, 2], [6], [], [7]]

This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the CSC (compressed sparse column) format this is rather simple. The matrix is then converted to LIL (linked list) format. This format stores the column indices for each row as a list in its rows attribute, so all we need to do is take that and convert it to list.

Note that for small arrays argsort based solutions are probably faster but at some not insanely large size this will cross over.

EDIT:

argsort-based numpy-only solution:

np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum())
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

If the order of indices within groups doesn't matter you can also try argpartition (it happens to make no difference in this small example but this is not guaranteed in general):

bb = np.bincount(x)[:-1].cumsum()
np.split(x.argpartition(bb),bb)
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

EDIT:

@Divakar recommends against the use of np.split. Instead, a loop is probably faster:

A = x.argsort(kind="stable")
B = np.bincount(x+1).cumsum()
[A[B[i-1]:B[i]] for i in range(1,len(B))]

Or you could use the brand new (Python3.8+) walrus operator:

A = x.argsort(kind="stable")
B = np.bincount(x)
L = 0
[A[L:(L:=L+b)] for b in B.tolist()]

EDIT(EDITED):

(Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran.

Compile with pythran -O3 <filename.py>

import numpy as np

#pythran export sort_to_bins(int[:],int)

def sort_to_bins(idx, mx):
    if mx==-1: 
        mx = idx.max() + 1
    cnts = np.zeros(mx + 2, int)
    for i in range(idx.size):
        cnts[idx[i] + 2] += 1
    for i in range(3, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(idx)
    for i in range(idx.size):
        res[cnts[idx[i]+1]] = i
        cnts[idx[i]+1] += 1
    return [res[cnts[i]:cnts[i+1]] for i in range(mx)]

Here numba wins by a whisker performance-wise:

repeat(lambda:enum_bins_numba_buffer(x),number=10)
# [0.6235917090671137, 0.6071486569708213, 0.6096088469494134]
repeat(lambda:sort_to_bins(x,-1),number=10)
# [0.6235359431011602, 0.6264424560358748, 0.6217901279451326]

Older stuff:

import numpy as np

#pythran export bincollect(int[:])

def bincollect(a):
    o = [[] for _ in range(a.max()+1)]
    for i,j in enumerate(a):
        o[j].append(i)
    return o

Timings vs. numba (old)

timeit(lambda:bincollect(x),number=10)
# 3.5732191529823467
timeit(lambda:enumerate_bins(x),number=10)
# 6.7462647299980745

103

answered Oct 09 '22 00:10

Paul Panzer

Although the request is for a numpy solution, I decided to see whether there is an interesting numba-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the argsort approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.)

@numba.jit(numba.void(numba.int64[:], 
                      numba.int64[:], 
                      numba.int64[:]), 
           nopython=True)
def enum_bins_numba_buffer_inner(ints, bins, starts):
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] += 1

@numba.jit(nopython=False)  # Not 100% sure this does anything...
def enum_bins_numba_buffer(ints):
    ends = np.bincount(ints).cumsum()
    starts = np.empty(ends.shape, dtype=np.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = np.empty(ints.shape, dtype=np.int64)
    enum_bins_numba_buffer_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python.

For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly.

This wrestles with numba's type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above.

@numba.jit(nopython=True)
def enum_bins_numba(ints):
    bins = numba.typed.List()
    for i in range(ints.max() + 1):
        inner = numba.typed.List()
        inner.append(0)  # An awkward way of forcing type inference.
        inner.pop()
        bins.append(inner)

    for x, i in enumerate(ints):
        bins[i].append(x)

    return bins

I tested these against the following:

def enum_bins_dict(ints):
    enum_bins = defaultdict(list)
    for k, v in enumerate(ints):
        enum_bins[v].append(k)
    return enum_bins

def enum_bins_list(ints):
    enum_bins = [[] for i in range(ints.max() + 1)]
    for x, i in enumerate(ints):
        enum_bins[i].append(x)
    return enum_bins

def enum_bins_sparse(ints):
    M, N = ints.max() + 1, ints.size
    return sparse.csc_matrix((ints, ints, np.arange(N + 1)),
                             (M, N)).tolil().rows.tolist()

I also tested them against a precompiled cython version similar to enum_bins_numba_buffer (described in detail below).

On a list of ten million random ints (ints = np.random.randint(0, 100, 10000000)) I get the following results:

enum_bins_dict(ints)
3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_list(ints)
3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_sparse(ints)
1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_numba(ints)
693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_cython(ints)
82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

enum_bins_numba_buffer(ints)
77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Impressively, this way of working with numba outperforms a cython version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with pythran to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the pythran version might also be quite a bit faster with this approach.

Here's the cython version for reference, with some build instructions. Once you have cython installed, you'll need a simple setup.py file like this:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        'enum_bins_cython',
        ['enum_bins_cython.pyx'],
    )
]

setup(
    ext_modules=cythonize(ext_modules),
    include_dirs=[numpy.get_include()]
)

And the cython module, enum_bins_cython.pyx:

# cython: language_level=3

import cython
import numpy
cimport numpy

@cython.boundscheck(False)
@cython.cdivision(True)
@cython.wraparound(False)
cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil:
    cdef long i, x
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] = starts[i] + 1

def enum_bins_cython(ints):
    assert (ints >= 0).all()
    # There might be a way to avoid storing two offset arrays and
    # save memory, but `enum_bins_inner` modifies the input, and
    # having separate lists of starts and ends is convenient for
    # the final partition stage.
    ends = numpy.bincount(ints).cumsum()
    starts = numpy.empty(ends.shape, dtype=numpy.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = numpy.empty(ints.shape, dtype=numpy.int64)
    enum_bins_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

With these two files in your working directory, run this command:

python setup.py build_ext --inplace

You can then import the function using from enum_bins_cython import enum_bins_cython.

answered Oct 09 '22 01:10

senderle

Related questions
                            
                                STATIC_URL undefined in base Django template
                            
                                django south migration, doesnt set default
                            
                                Compare date and datetime in Django
                            
                                python isdigit() function return true for non digit character u'\u2466'
                            
                                Combining base url with resultant href in scrapy
                            
                                python convert ipv6 to an integer
                            
                                How to get random single document from 1 billion documents in mongoDB using python? [duplicate]
                            
                                Can't sort my list because it is NoneType? Simple Python
                            
                                How to add Headers to Scrapy CrawlSpider Requests?
                            
                                Python: remove odd number from a list
                            
                                Powersets in Python using itertools
                            
                                Rounding entries in a Pandas DafaFrame
                            
                                Generate string of numbers python
                            
                                bitwise operations between elements in a list
                            
                                How do I setup a local HTTP server using Python
                            
                                django form field regex validation for alphanumeric characters [duplicate]
                            
                                How to use spacers in Qt
                            
                                MongoLab/PyMongo connection error
                            
                                Add prefix to specific columns of Dataframe
                            
                                Is it possible to render HTML in Tkinter? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numpy: get the index of the elements of a 1d array as a 2d array

Tags:

python

numpy

numpy-ndarray

Frederico Schardong

People also ask

2 Answers

Paul Panzer

senderle

Recent Activity

Donate For Us