I am working with numpy and the following data (all matrices have all cells nonegative):
>>> X1.shape
(59022, 16)
>>> X3.shape
(59022, 84122)
>>> ind.shape
(59022,)
>>> np.max( ind )
59021
>>> np.min( ind )
0
>>> len( set ( ind.tolist() ) )
59022
In short, ind is simply a way to rearrange rows in either matrix. The problem is that while rearranging the rows in the smaller array (X1) works as desired, the same operation on the bigger array (X2) leads to all rows below a certain point be zero. Here is what I do:
>>> np.nonzero( np.sum( X3, axis=1 ) )[0].shape
(59022,)
Now let's see what happens if the rows are rearranged:
>>> np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape
(7966,)
But for the smaller matrix everything works just fine:
>>> np.nonzero( np.sum( X1, axis=1 ) )[0].shape
(59022,)
>>> np.nonzero( np.sum( X1[ ind, : ], axis=1 ) )[0].shape
(59022,)
One thing I am guessing I can try is to use sparse matrices but I'm just wondering if I can make this thing work. I have 256GB of RAM so I don't think memory is a constraint. Thanks for your hints!
I strongly suspect your numpy version. I suspect it may be a manifestation of this bug, where you can see that setting a large array to a value silently fails and outputs zeros. Could maybe track it down for definite with numpy versions and a bit more time.
I have written a test script here which should generate datasets similar to those you describe (code copied below for completeness). I cannot reproduce the original issue..
I can set up with a 59022 x 84122 np.array
with dtype=np.uint16
, but the command of interest gives an out of memory message. So I am memory limited, so can't test the exact values you give.
However, if I drop the width down to 54122, the code works as expected (doesn't output zeros in rows > 7966).
My numpy version is
numpy.version.version == '1.8.2'
My python version and system is as follows:
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AM D64)] on win32
import numpy as np
import os
# Function to make some test data that will fit in memory...
def makeX(ind,width):
rowcount = len(ind)
Xret = np.ones((rowcount,width),dtype=np.uint16)
col0 = ind.copy()
col0 = col0.reshape((rowcount,1))
np.random.shuffle(col0)
for r in range(len(Xret)):
Xret[r] = bytearray(os.urandom(width))
Xret[r][0] = col0[r]
return Xret
X3width = 54122 # if this is 84122, the last line fails with MemoryError on my box
# (16GB memory ~13 available)
ind = np.array(range(59022))
X1 = makeX(ind,16)
X3 = makeX(ind,54122)
print('Shapes of ind, X1 and X3')
print(ind.shape)
print(X1.shape)
print(X3.shape)
print('Contents of ind, X1 and X3')
print(ind)
print(X1)
print(X3)
print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3, axis=1 ) )[0].shape)
print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape)
#This outputs (59022,) as expected
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With