I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results. I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but <code>len()</code> can do that. Here is one example I worked up that seems reasonable: <pre class="prettyprint"><code>import itertools import numpy as np A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb), (C_Cidx==cc), (D_Didx==dd))]]]) </code></pre> This however makes me nervous that I will run out of memory on large arrays. I could also do it this way: <pre class="prettyprint"><code>b_inds = np.empty((len(A), 10), dtype=np.bool) c_inds = np.empty((len(A), 12), dtype=np.bool) d_inds = np.empty((len(A), 24), dtype=np.bool) for i in range(10): b_inds[:,i] = B_Bidx = i for i in range(12): c_inds[:,i] = C_Cidx = i for i in range(24): d_inds[:,i] = D_Didx = i # get the A data for the 1,2,3 B,C,D bin print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]] </code></pre> At least here the output is of known and constant size. Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed? <hr> Based on the answer by HYRY this is the path I decided to take. <pre class="prettyprint"><code>import numpy as np import pandas as pd np.random.seed(42) A = np.random.random_sample(1e7) B = (np.random.random_sample(1e7) + 10)*20 C = (np.random.random_sample(1e7) + 20)*40 D = (np.random.random_sample(1e7) + 80)*80 # make the edges of the bins we want Bbins = np.linspace(B.min(), B.max(), 9) Cbins = np.linspace(C.min(), C.max(), 10) # note different number Dbins = np.linspace(D.min(), D.max(), 11) # note different number sA = pd.Series(A) cB = pd.cut(B, Bbins, include_lowest=True) cC = pd.cut(C, Cbins, include_lowest=True) cD = pd.cut(D, Dbins, include_lowest=True) dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels}) g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices # this then gives all the indices that match the group print g[0,1,2] # this is all the array A data for that B,C,D bin print sA[g[0,1,2]] </code></pre> This method seems lightning fast even for huge arrays.

How about use <code>groupby</code> in Pandas. Fix some problem in your code first: <pre class="prettyprint"><code>import itertools import numpy as np np.random.seed(42) A = np.random.random_sample(1e4) B = (np.random.random_sample(1e4) + 10)*20 C = (np.random.random_sample(1e4) + 20)*40 D = (np.random.random_sample(1e4) + 80)*80 # make the edges of the bins Bbins = np.linspace(B.min(), B.max(), 10) Cbins = np.linspace(C.min(), C.max(), 12) # note different number Dbins = np.linspace(D.min(), D.max(), 24) # note different number B_Bidx = np.digitize(B, Bbins) C_Cidx = np.digitize(C, Cbins) D_Didx = np.digitize(D, Dbins) a_bins = [] for bb, cc, dd in itertools.product(np.unique(B_Bidx), np.unique(C_Cidx), np.unique(D_Didx)): a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]]) a_bins[1000] </code></pre> output: <pre class="prettyprint"><code>[(4, 6, 17), array([ 0.70723863, 0.907611 , 0.46214047])] </code></pre> Here is the code that return the same result by Pandas: <pre class="prettyprint"><code>import pandas as pd cB = pd.cut(B, 9) cC = pd.cut(C, 11) cD = pd.cut(D, 23) sA = pd.Series(A) g = sA.groupby([cB.labels, cC.labels, cD.labels]) g.get_group((3, 5, 16)) </code></pre> output: <pre class="prettyprint"><code>800 0.707239 2320 0.907611 9388 0.462140 dtype: float64 </code></pre> If you want to calculate some statistics of every group, you can call the method of <code>g</code>, for example: <pre class="prettyprint"><code>g.mean() </code></pre> returns: <pre class="prettyprint"><code>0 0 0 0.343566 1 0.410979 2 0.700007 3 0.189936 4 0.452566 5 0.565330 6 0.539565 7 0.530867 8 0.568120 9 0.587762 11 0.352453 12 0.484903 13 0.477969 14 0.484328 15 0.467357 ... 8 10 8 0.559859 9 0.570652 10 0.656718 11 0.353938 12 0.628980 13 0.372350 14 0.404543 15 0.387920 16 0.742292 17 0.530866 18 0.389236 19 0.628461 20 0.387384 21 0.541831 22 0.573023 Length: 2250, dtype: float64 </code></pre>

Extending numpy.digitize to multi-dimensional data

Tags:

python

numpy

I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results.

I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but len() can do that.

Here is one example I worked up that seems reasonable:

import itertools
import numpy as np

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), [A[np.bitwise_and((B_Bidx==bb),
                                                   (C_Cidx==cc),
                                                   (D_Didx==dd))]]])

This however makes me nervous that I will run out of memory on large arrays.

I could also do it this way:

b_inds = np.empty((len(A), 10), dtype=np.bool)
c_inds = np.empty((len(A), 12), dtype=np.bool)
d_inds = np.empty((len(A), 24), dtype=np.bool)
for i in range(10):
    b_inds[:,i] = B_Bidx = i     
for i in range(12):
    c_inds[:,i] = C_Cidx = i     
for i in range(24):
    d_inds[:,i] = D_Didx = i     
# get the A data for the 1,2,3 B,C,D bin
print A[b_inds[:,1] & c_inds[:,2] & d_inds[:,3]]

At least here the output is of known and constant size.

Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed?

Based on the answer by HYRY this is the path I decided to take.

import numpy as np
import pandas as pd

np.random.seed(42)
A =  np.random.random_sample(1e7)
B = (np.random.random_sample(1e7) + 10)*20
C = (np.random.random_sample(1e7) + 20)*40
D = (np.random.random_sample(1e7) + 80)*80
# make the edges of the bins we want
Bbins = np.linspace(B.min(), B.max(), 9)
Cbins = np.linspace(C.min(), C.max(), 10) # note different number
Dbins = np.linspace(D.min(), D.max(), 11) # note different number
sA = pd.Series(A)
cB = pd.cut(B, Bbins, include_lowest=True)
cC = pd.cut(C, Cbins, include_lowest=True)
cD = pd.cut(D, Dbins, include_lowest=True)

dat = pd.DataFrame({'A':A, 'cB':cB.labels, 'cC':cC.labels, 'cD':cD.labels})
g = sA.groupby([cB.labels, cC.labels, cD.labels]).indices
# this then gives all the indices that match the group 
print g[0,1,2]
# this is all the array A data for that B,C,D bin
print sA[g[0,1,2]]

This method seems lightning fast even for huge arrays.

850

asked Jul 08 '14 23:07

Brian Larsen

1 Answers

How about use groupby in Pandas. Fix some problem in your code first:

import itertools
import numpy as np

np.random.seed(42)

A = np.random.random_sample(1e4)
B = (np.random.random_sample(1e4) + 10)*20
C = (np.random.random_sample(1e4) + 20)*40
D = (np.random.random_sample(1e4) + 80)*80

# make the edges of the bins
Bbins = np.linspace(B.min(), B.max(), 10)
Cbins = np.linspace(C.min(), C.max(), 12) # note different number
Dbins = np.linspace(D.min(), D.max(), 24) # note different number

B_Bidx = np.digitize(B, Bbins)
C_Cidx = np.digitize(C, Cbins)
D_Didx = np.digitize(D, Dbins)

a_bins = []
for bb, cc, dd in itertools.product(np.unique(B_Bidx), 
                                    np.unique(C_Cidx), 
                                    np.unique(D_Didx)):
    a_bins.append([(bb, cc, dd), A[(B_Bidx==bb) & (C_Cidx==cc) & (D_Didx==dd)]])

a_bins[1000]

output:

[(4, 6, 17), array([ 0.70723863,  0.907611  ,  0.46214047])]

Here is the code that return the same result by Pandas:

import pandas as pd

cB = pd.cut(B, 9)
cC = pd.cut(C, 11)
cD = pd.cut(D, 23)

sA = pd.Series(A)
g = sA.groupby([cB.labels, cC.labels, cD.labels])
g.get_group((3, 5, 16))

output:

800     0.707239
2320    0.907611
9388    0.462140
dtype: float64

If you want to calculate some statistics of every group, you can call the method of g, for example:

g.mean()

returns:

0  0  0     0.343566
      1     0.410979
      2     0.700007
      3     0.189936
      4     0.452566
      5     0.565330
      6     0.539565
      7     0.530867
      8     0.568120
      9     0.587762
      11    0.352453
      12    0.484903
      13    0.477969
      14    0.484328
      15    0.467357
...
8  10  8     0.559859
       9     0.570652
       10    0.656718
       11    0.353938
       12    0.628980
       13    0.372350
       14    0.404543
       15    0.387920
       16    0.742292
       17    0.530866
       18    0.389236
       19    0.628461
       20    0.387384
       21    0.541831
       22    0.573023
Length: 2250, dtype: float64

144

answered Oct 16 '22 19:10

HYRY

Related questions
                            
                                Time complexity of random.sample
                            
                                how to avoid python numeric literals beginning with "0" being treated as octal?
                            
                                Cython and fortran - how to compile together without f2py
                            
                                Which special methods bypasses __getattribute__ in Python?
                            
                                Please explain reasoning behind code snippet from argparse module
                            
                                Trying to verify SHA1 message signature using Python. What am I doing wrong?
                            
                                How does a function in Python remember its values after it returns? [duplicate]
                            
                                Iterating over a Django QuerySet while deleting objects in the same QuerySet
                            
                                Sorting algorithm to keep equal values separated
                            
                                How do I unit-test HTTPS requests in Flask?
                            
                                Numba slow when assigning to an array?
                            
                                How can I transform this (100, 100) numpy array into a grayscale sprite in pygame?
                            
                                PickleType with Mutable Tracking in SqlAlchemy
                            
                                Django Elastic Beanstalk Deploy showing 404
                            
                                How to search content across multiple channel in youtube api?
                            
                                LiveServerTestCase hangs at python-requests post call in django view
                            
                                How do you visualize a ward tree from sklearn.cluster.ward_tree?
                            
                                Exit code 139 when performing image subtraction
                            
                                Python - detect when my object is written to stdout?
                            
                                Django REST Framework - OAuth2 Consumer API from external provider

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With