Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy: get the index of the elements of a 1d array as a 2d array

I have a numpy array like this: [1 2 2 0 0 1 3 5]

Is it possible to get the index of the elements as a 2d array? For instance the answer for the above input would be [[3 4], [0 5], [1 2], [6], [], [7]]

Currently I have to loop the different values and call numpy.where(input == i) for each value, which has terrible performance with a big enough input.

like image 807
Frederico Schardong Avatar asked Oct 20 '19 02:10

Frederico Schardong


People also ask

How do you find the index of value in a 2D NumPy array?

Index of element in 2D array We can also use the np. where() function to find the position/index of occurrences of elements in a two-dimensional or multidimensional array. For a 2D array, the returned tuple will contain two numpy arrays one for the rows and the other for the columns.

How do I convert a 1D array to 2D python NumPy?

convert a 1-dimensional array into a 2-dimensional array by adding new axis. a=np. array([10,20,30,40,50,60]) b=a[:,np. newaxis]--it will convert it to two dimension.

How do you find the index of an element of a NumPy array?

Using ndenumerate() function to find the Index of value It is usually used to find the first occurrence of the element in the given numpy array.


2 Answers

Here is an O(max(x)+len(x)) approach using scipy.sparse:

import numpy as np
from scipy import sparse

x = np.array("1 2 2 0 0 1 3 5".split(),int)
x
# array([1, 2, 2, 0, 0, 1, 3, 5])


M,N = x.max()+1,x.size
sparse.csc_matrix((x,x,np.arange(N+1)),(M,N)).tolil().rows.tolist()
# [[3, 4], [0, 5], [1, 2], [6], [], [7]]

This works by creating a sparse matrix with entries at positions (x[0],0), (x[1],1), ... Using the CSC (compressed sparse column) format this is rather simple. The matrix is then converted to LIL (linked list) format. This format stores the column indices for each row as a list in its rows attribute, so all we need to do is take that and convert it to list.

Note that for small arrays argsort based solutions are probably faster but at some not insanely large size this will cross over.

EDIT:

argsort-based numpy-only solution:

np.split(x.argsort(kind="stable"),np.bincount(x)[:-1].cumsum())
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

If the order of indices within groups doesn't matter you can also try argpartition (it happens to make no difference in this small example but this is not guaranteed in general):

bb = np.bincount(x)[:-1].cumsum()
np.split(x.argpartition(bb),bb)
# [array([3, 4]), array([0, 5]), array([1, 2]), array([6]), array([], dtype=int64), array([7])]

EDIT:

@Divakar recommends against the use of np.split. Instead, a loop is probably faster:

A = x.argsort(kind="stable")
B = np.bincount(x+1).cumsum()
[A[B[i-1]:B[i]] for i in range(1,len(B))]

Or you could use the brand new (Python3.8+) walrus operator:

A = x.argsort(kind="stable")
B = np.bincount(x)
L = 0
[A[L:(L:=L+b)] for b in B.tolist()]

EDIT(EDITED):

(Not pure numpy): As an alternative to numba (see @senderle's post) we can also use pythran.

Compile with pythran -O3 <filename.py>

import numpy as np

#pythran export sort_to_bins(int[:],int)

def sort_to_bins(idx, mx):
    if mx==-1: 
        mx = idx.max() + 1
    cnts = np.zeros(mx + 2, int)
    for i in range(idx.size):
        cnts[idx[i] + 2] += 1
    for i in range(3, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(idx)
    for i in range(idx.size):
        res[cnts[idx[i]+1]] = i
        cnts[idx[i]+1] += 1
    return [res[cnts[i]:cnts[i+1]] for i in range(mx)]

Here numba wins by a whisker performance-wise:

repeat(lambda:enum_bins_numba_buffer(x),number=10)
# [0.6235917090671137, 0.6071486569708213, 0.6096088469494134]
repeat(lambda:sort_to_bins(x,-1),number=10)
# [0.6235359431011602, 0.6264424560358748, 0.6217901279451326]

Older stuff:

import numpy as np

#pythran export bincollect(int[:])

def bincollect(a):
    o = [[] for _ in range(a.max()+1)]
    for i,j in enumerate(a):
        o[j].append(i)
    return o

Timings vs. numba (old)

timeit(lambda:bincollect(x),number=10)
# 3.5732191529823467
timeit(lambda:enumerate_bins(x),number=10)
# 6.7462647299980745
like image 103
Paul Panzer Avatar answered Oct 09 '22 00:10

Paul Panzer


Although the request is for a numpy solution, I decided to see whether there is an interesting numba-based solution. And indeed there is! Here's an approach that represents the partitioned list as a ragged array stored in a single preallocated buffer. This takes some inspiration from the argsort approach proposed by Paul Panzer. (For an older version that didn't do as well, but was simpler, see below.)

@numba.jit(numba.void(numba.int64[:], 
                      numba.int64[:], 
                      numba.int64[:]), 
           nopython=True)
def enum_bins_numba_buffer_inner(ints, bins, starts):
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] += 1

@numba.jit(nopython=False)  # Not 100% sure this does anything...
def enum_bins_numba_buffer(ints):
    ends = np.bincount(ints).cumsum()
    starts = np.empty(ends.shape, dtype=np.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = np.empty(ints.shape, dtype=np.int64)
    enum_bins_numba_buffer_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

This processes a ten-million item list in 75ms, which is nearly a 50x speedup from a list-based version written in pure Python.

For a slower but somewhat more readable version, here's what I had before, based on recently added experimental support for dynamically sized "typed lists," which allow us to fill up each bin in an out-of-order way much more quickly.

This wrestles with numba's type inference engine a bit, and I'm sure there's a better way to handle that part. This also turns out to be almost 10x slower than the above.

@numba.jit(nopython=True)
def enum_bins_numba(ints):
    bins = numba.typed.List()
    for i in range(ints.max() + 1):
        inner = numba.typed.List()
        inner.append(0)  # An awkward way of forcing type inference.
        inner.pop()
        bins.append(inner)

    for x, i in enumerate(ints):
        bins[i].append(x)

    return bins

I tested these against the following:

def enum_bins_dict(ints):
    enum_bins = defaultdict(list)
    for k, v in enumerate(ints):
        enum_bins[v].append(k)
    return enum_bins

def enum_bins_list(ints):
    enum_bins = [[] for i in range(ints.max() + 1)]
    for x, i in enumerate(ints):
        enum_bins[i].append(x)
    return enum_bins

def enum_bins_sparse(ints):
    M, N = ints.max() + 1, ints.size
    return sparse.csc_matrix((ints, ints, np.arange(N + 1)),
                             (M, N)).tolil().rows.tolist()

I also tested them against a precompiled cython version similar to enum_bins_numba_buffer (described in detail below).

On a list of ten million random ints (ints = np.random.randint(0, 100, 10000000)) I get the following results:

enum_bins_dict(ints)
3.71 s ± 80.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_list(ints)
3.28 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_sparse(ints)
1.02 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_numba(ints)
693 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

enum_bins_cython(ints)
82.3 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

enum_bins_numba_buffer(ints)
77.4 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Impressively, this way of working with numba outperforms a cython version of the same function, even with bounds-checking turned off. I don't yet have enough familiarity with pythran to test this approach using it, but I'd be interested to see a comparison. It seems likely based on this speedup that the pythran version might also be quite a bit faster with this approach.

Here's the cython version for reference, with some build instructions. Once you have cython installed, you'll need a simple setup.py file like this:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

ext_modules = [
    Extension(
        'enum_bins_cython',
        ['enum_bins_cython.pyx'],
    )
]

setup(
    ext_modules=cythonize(ext_modules),
    include_dirs=[numpy.get_include()]
)

And the cython module, enum_bins_cython.pyx:

# cython: language_level=3

import cython
import numpy
cimport numpy

@cython.boundscheck(False)
@cython.cdivision(True)
@cython.wraparound(False)
cdef void enum_bins_inner(long[:] ints, long[:] bins, long[:] starts) nogil:
    cdef long i, x
    for x in range(len(ints)):
        i = ints[x]
        bins[starts[i]] = x
        starts[i] = starts[i] + 1

def enum_bins_cython(ints):
    assert (ints >= 0).all()
    # There might be a way to avoid storing two offset arrays and
    # save memory, but `enum_bins_inner` modifies the input, and
    # having separate lists of starts and ends is convenient for
    # the final partition stage.
    ends = numpy.bincount(ints).cumsum()
    starts = numpy.empty(ends.shape, dtype=numpy.int64)
    starts[1:] = ends[:-1]
    starts[0] = 0

    bins = numpy.empty(ints.shape, dtype=numpy.int64)
    enum_bins_inner(ints, bins, starts)

    starts[1:] = ends[:-1]
    starts[0] = 0
    return [bins[s:e] for s, e in zip(starts, ends)]

With these two files in your working directory, run this command:

python setup.py build_ext --inplace

You can then import the function using from enum_bins_cython import enum_bins_cython.

like image 38
senderle Avatar answered Oct 09 '22 01:10

senderle