NumPy

Question

np.searchsorted only for 1D arrays.

I have a lexicographically sorted 2D array, meaning that 0-th row is sorted, then for same values of 0-th row corresponding elements of 1-th row are sorted too, for same values of 1-th row values of 2-th row are sorted too. In other words tuples consisting of columns are sorted.

I have some other 2D array with tuples-columns that need to be inserted into first 2D array into correct positions of columns. For 1D case np.searchsorted was usually used in order to find correct positions.

But for 2D array is there an alternative to np.searchsorted? Something analagous to how np.lexsort is a 2D alternative for 1D np.argsort.

If no such function then can be this functionality implemented in an efficient way using existing numpy functions?

I am interested in efficient solutions for arrays of any dtype including np.object_.

One naive way to handle any dtype case would be to convert each column of both arrays to 1D array (or tuple) and then store these columns as another 1D array of dtype = np.object_. Maybe it is not that naive and could be even fast especially if columns are quite high.

Mad Physicist · Accepted Answer

Two things can help you here: (1) you can sort and search structured arrays, and (2) if you have finite collections that can be mapped to integers, you can use that to your advantage.

Viewing as 1D

Let's say you have an array of strings that you want to insert into:

data = np.array([['a', '1'], ['a', 'z'], ['b', 'a']], dtype=object)

Since arrays are never ragged, you can construct a dtype that is the size of a row:

dt = np.dtype([('', data.dtype)] * data.shape[1])

Using my shamelessly plugged answer here, you can view the original 2D array as 1D now:

view = np.ndarray(data.shape[:1], dtype=dt, buffer=data)

Searching can be done in a totally straightforward manner now:

key = np.array([('a', 'a')], dtype=dt)
index = np.searchsorted(view, key)

You can even find the insertion indices of incomplete elements by using the appropriate minimum values. For strings this would be ''.

Faster Comparison
You may get better mileage out of the comparison if you don't have to check each field of the dtype. You can make a similar dtype with a single homogeneous field:

dt2 = np.dtype([('row', data.dtype, data.shape[1])])

Constructing the view is the same as before:

view = np.ndarray(data.shape[:1], dtype=dt2, buffer=data)

The key is done a little differently this time (another plug here):

key = np.array([(['a', 'a'],)], dtype=dt2)

The sort order imposed on objects is not correct with this method: Sorting array of objects by row using custom dtype. I am leaving a reference here in case there is a fix in the linked question. Also, it is still quite useful for sorting integers.

Integer Mapping

If you have a finite number of objects to search, it is easier to map them to integers:

idata = np.empty(data.shape, dtype=int)
keys = [None] * data.shape[1]     # Map index to key per column
indices = [None] * data.shape[1]  # Map key to index per column
for i in range(data.shape[1]):
    keys[i], idata[:, i] = np.unique(data[:, i], return_inverse=True)
    indices[i] = {k: i for i, k in enumerate(keys[i])}  # Assumes hashable objects

idt = np.dtype([('row', idata.dtype, idata.shape[1])])
view = idata.view(idt).ravel()

This only works if data actually contains all the possible keys in each column. Otherwise, you will have to get the forward and reverse mappings by other means. Once you have that established, setting up the keys is much simpler, and only requires indices:

key = np.array([index[k] for index, k in zip(indices, ['a', 'a'])])

Further Improvements

If the number of categories you have is eight or less, and each category has 256 or fewer elements, you can construct an even better hash by fitting everything into a single np.uint64 element or so.

k = math.ceil(math.log(data.shape[1], 2))  # math.log provides base directly
assert 0 < k <= 64
idata = np.empty((data.shape[:1], k), dtype=np.uint8)
...
idata = idata.view(f'>u{k}').ravel()

Keys are made similarly as well:

key = np.array([index[k] for index, k in zip(indices, ['a', 'a'])]).view(f'>u{k}')

Timing

I've timed the methods shown here (not the other answers) using randomly shuffled strings. Key timing parameters are:

M: number of rows: 10**{2, 3, 4, 5}
N: number of columns: 2**{3, 4, 5, 6}
K: number of elements to insert: 1, 10, M // 10
method: individual_fields, combined_field, int_mapping, int_packing. Functions shown below.

For the last two methods, I assume that you will pre-convert the data into the mapped dtype, but not the search keys. I am therefore passing in the converted data, but timing the conversion of the keys.

import numpy as np
from math import ceil, log

def individual_fields(data, keys):
    dt = [('', data.dtype)] * data.shape[1]
    dview = np.ndarray(data.shape[:1], dtype=dt, buffer=data)
    kview = np.ndarray(keys.shape[:1], dtype=dt, buffer=keys)
    return np.searchsorted(dview, kview)

def combined_fields(data, keys):
    dt = [('row', data.dtype, data.shape[1])]
    dview = np.ndarray(data.shape[:1], dtype=dt, buffer=data)
    kview = np.ndarray(keys.shape[:1], dtype=dt, buffer=keys)
    return np.searchsorted(dview, kview)

def int_mapping(idata, keys, indices):
    idt = np.dtype([('row', idata.dtype, idata.shape[1])])
    dview = idata.view(idt).ravel()
    kview = np.empty(keys.shape[0], dtype=idt)
    for i, (index, key) in enumerate(zip(indices, keys.T)):
        kview['row'][:, i] = [index[k] for k in key]
    return np.searchsorted(dview, kview)

def int_packing(idata, keys, indices):
    idt = f'>u{idata.shape[1]}'
    dview = idata.view(idt).ravel()
    kview = np.empty(keys.shape, dtype=np.uint8)
    for i, (index, key) in enumerate(zip(indices, keys.T)):
        kview[:, i] = [index[k] for k in key]
    kview = kview.view(idt).ravel()
    return np.searchsorted(dview, kview)

The timing code:

from math import ceil, log
from string import ascii_lowercase
from timeit import Timer

def time(m, n, k, fn, *args):
    t = Timer(lambda: fn(*args))
    s = t.autorange()[0]
    print(f'M={m}; N={n}; K={k} {fn.__name__}: {min(t.repeat(5, s)) / s}')

selection = np.array(list(ascii_lowercase), dtype=object)
for lM in range(2, 6):
    M = 10**lM
    for lN in range(3, 6):
        N = 2**lN
        data = np.random.choice(selection, size=(M, N))
        np.ndarray(data.shape[0], dtype=[('', data.dtype)] * data.shape[1], buffer=data).sort()
        idata = np.array([[ord(a) - ord('a') for a in row] for row in data], dtype=np.uint8)
        ikeys = [selection] * data.shape[1]
        indices = [{k: i for i, k in enumerate(selection)}] * data.shape[1]
        for K in (1, 10, M // 10):
            key = np.random.choice(selection, size=(K, N))
            time(M, N, K, individual_fields, data, key)
            time(M, N, K, combined_fields, data, key)
            time(M, N, K, int_mapping, idata, key, indices)
            if N <= 8:
                time(M, N, K, int_packing, idata, key, indices)

The results:

M=100 (units=us)

   |                           K                           |
   +---------------------------+---------------------------+
N  |             1             |            10             |
   +------+------+------+------+------+------+------+------+
   |  IF  |  CF  |  IM  |  IP  |  IF  |  CF  |  IM  |  IP  |
---+------+------+------+------+------+------+------+------+
 8 | 25.9 | 18.6 | 52.6 | 48.2 | 35.8 | 22.7 | 76.3 | 68.2 | 
16 | 40.1 | 19.0 | 87.6 |  --  | 51.1 | 22.8 | 130. |  --  |
32 | 68.3 | 18.7 | 157. |  --  | 79.1 | 22.4 | 236. |  --  |
64 | 125. | 18.7 | 290. |  --  | 135. | 22.4 | 447. |  --  |
---+------+------+------+------+------+------+------+------+

M=1000 (units=us)

   |                                         K                                         |
   +---------------------------+---------------------------+---------------------------+
N  |             1             |            10             |            100            |
   +------+------+------+------+------+------+------+------+------+------+------+------+
   |  IF  |  CF  |  IM  |  IP  |  IF  |  CF  |  IM  |  IP  |  IF  |  CF  |  IM  |  IP  |
---+------+------+------+------+------+------+------+------+------+------+------+------+
 8 | 26.9 | 19.1 | 55.0 | 55.0 | 44.8 | 25.1 | 79.2 | 75.0 | 218. | 74.4 | 305. | 250. |
16 | 41.0 | 19.2 | 90.5 |  --  | 59.3 | 24.6 | 134. |  --  | 244. | 79.0 | 524. |  --  | 
32 | 68.5 | 19.0 | 159. |  --  | 87.4 | 24.7 | 241. |  --  | 271. | 80.5 | 984. |  --  |
64 | 128. | 19.7 | 312. |  --  | 168. | 26.0 | 549. |  --  | 396. | 7.78 | 2.0k |  --  |
---+------+------+------+------+------+------+------+------+------+------+------+------+

M=10K (units=us)

   |                                         K                                         |
   +---------------------------+---------------------------+---------------------------+
N  |             1             |            10             |           1000            |
   +------+------+------+------+------+------+------+------+------+------+------+------+
   |  IF  |  CF  |  IM  |  IP  |  IF  |  CF  |  IM  |  IP  |  IF  |  CF  |  IM  |  IP  |
---+------+------+------+------+------+------+------+------+------+------+------+------+
 8 | 28.8 | 19.5 | 54.5 | 107. | 57.0 | 27.2 | 90.5 | 128. | 3.2k | 762. | 2.7k | 2.1k |
16 | 42.5 | 19.6 | 90.4 |  --  | 73.0 | 27.2 | 140. |  --  | 3.3k | 752. | 4.6k |  --  |
32 | 73.0 | 19.7 | 164. |  --  | 104. | 26.7 | 246. |  --  | 3.4k | 803. | 8.6k |  --  |
64 | 135. | 19.8 | 302. |  --  | 162. | 26.1 | 466. |  --  | 3.7k | 791. | 17.k |  --  |
---+------+------+------+------+------+------+------+------+------+------+------+------+

individual_fields (IF) is generally fastest working method. Its complexity grows in proportion to the number of columns. Unfortunately combined_fields (CF) does not work for object arrays. Otherwise, it would be not only the fastest method, but also one that does not gain complexity with increasing columns.

All the techniques I thought would be faster are not, because mapping python objects to keys is slow (actual lookup of packed int arrays, for example, is much much faster than structured arrays).

References

Here are the additional questions I had to ask to get this code working at all:

View object array under different dtype
Creating array with single structured element containing an array
Sorting array of objects by row using custom dtype

NumPy - np.searchsorted for 2-D arrays

Tags:

python

arrays

sorting

lexicographic

Arty

1 Answers

Mad Physicist

Recent Activity

Donate For Us