Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extending numpy.digitize to multi-dimensional data

Tags:

python

numpy

I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results.

I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but len() can do that.

Here is one example I worked up that seems reasonable:

import itertools
import numpy as np

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb),
                                                   (C_Cidx==cc),
                                                   (D_Didx==dd))]]])

This however makes me nervous that I will run out of memory on large arrays.

I could also do it this way:

b_inds = np.empty((len(A), 10), dtype=np.bool)
c_inds = np.empty((len(A), 12), dtype=np.bool)
d_inds = np.empty((len(A), 24), dtype=np.bool)
for i in range(10):
    b_inds[:,i] = B_Bidx = i     
for i in range(12):
    c_inds[:,i] = C_Cidx = i     
for i in range(24):
    d_inds[:,i] = D_Didx = i     
# get the A data for the 1,2,3 B,C,D bin
print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]

At least here the output is of known and constant size.

Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed?


Based on the answer by HYRY this is the path I decided to take.

import numpy as np
import pandas as pd

np.random.seed(42)
A =  np.random.random_sample(1e7)
B = (np.random.random_sample(1e7) + 10)*20
C = (np.random.random_sample(1e7) + 20)*40
D = (np.random.random_sample(1e7) + 80)*80
# make the edges of the bins we want
Bbins = np.linspace(B.min(), B.max(), 9)
Cbins = np.linspace(C.min(), C.max(), 10) # note different number
Dbins = np.linspace(D.min(), D.max(), 11) # note different number
sA = pd.Series(A)
cB = pd.cut(B, Bbins, include_lowest=True)
cC = pd.cut(C, Cbins, include_lowest=True)
cD = pd.cut(D, Dbins, include_lowest=True)

dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels})
g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices
# this then gives all the indices that match the group 
print g[0,1,2]
# this is all the array A data for that B,C,D bin
print sA[g[0,1,2]]

This method seems lightning fast even for huge arrays.

like image 850
Brian Larsen Avatar asked Jul 08 '14 23:07

Brian Larsen


People also ask

What is the use of NumPy digitize?

What is Numpy digitize () Numpy digitize () function helps to get the indices of the bin to which each value of the input array belongs and returns an array containing the indices of the bin. Input array having the values and output array holding the indices of bins can be multidimensional.

What is NumPy in Python?

NumPy (Numerical Python) is a Python library that comprises of multidimensional arrays and numerous functions to perform various mathematical and logical operations on them. NumPy also consists of various functions to perform linear algebra operations and generate random numbers.

How to find the indices of the bins corresponding to NumPy?

Numpy is very popular python packages that allow you to manipulate NumPy array. There are many NumPy function that does so. Numpy digitize is one of them. Using this function you can find the indices of the bins corresponding to the input array.

How do I access the contents of an array in NumPy?

An n-dimensional (multidimensional) array has a fixed size and contains items of the same type. the contents of the multidimensional array can be accessed and modified by using indexing and slicing the array as desired. For accessing elements of an array we need to first import the library: import numpy as np


1 Answers

How about use groupby in Pandas. Fix some problem in your code first:

import itertools
import numpy as np

np.random.seed(42)

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]])

a_bins[1000]

output:

[(4, 6, 17), array([ 0.70723863,  0.907611  ,  0.46214047])]

Here is the code that return the same result by Pandas:

import pandas as pd

cB = pd.cut(B, 9)
cC = pd.cut(C, 11)
cD = pd.cut(D, 23)

sA = pd.Series(A)
g = sA.groupby([cB.labels, cC.labels, cD.labels])
g.get_group((3, 5, 16))

output:

800     0.707239
2320    0.907611
9388    0.462140
dtype: float64

If you want to calculate some statistics of every group, you can call the method of g, for example:

g.mean()

returns:

0  0  0     0.343566
      1     0.410979
      2     0.700007
      3     0.189936
      4     0.452566
      5     0.565330
      6     0.539565
      7     0.530867
      8     0.568120
      9     0.587762
      11    0.352453
      12    0.484903
      13    0.477969
      14    0.484328
      15    0.467357
...
8  10  8     0.559859
       9     0.570652
       10    0.656718
       11    0.353938
       12    0.628980
       13    0.372350
       14    0.404543
       15    0.387920
       16    0.742292
       17    0.530866
       18    0.389236
       19    0.628461
       20    0.387384
       21    0.541831
       22    0.573023
Length: 2250, dtype: float64
like image 144
HYRY Avatar answered Oct 16 '22 19:10

HYRY