Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rearranging rows in a big numpy array zeros some rows. How to fix it?

Tags:

python

numpy

I am working with numpy and the following data (all matrices have all cells nonegative):

>>> X1.shape
(59022, 16)
>>> X3.shape
(59022, 84122)
>>> ind.shape
(59022,)
>>> np.max( ind )
59021
>>> np.min( ind )
0
>>> len( set ( ind.tolist() ) )
59022

In short, ind is simply a way to rearrange rows in either matrix. The problem is that while rearranging the rows in the smaller array (X1) works as desired, the same operation on the bigger array (X2) leads to all rows below a certain point be zero. Here is what I do:

>>> np.nonzero( np.sum( X3, axis=1 ) )[0].shape
(59022,)

Now let's see what happens if the rows are rearranged:

>>> np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape
(7966,)

But for the smaller matrix everything works just fine:

>>> np.nonzero( np.sum( X1, axis=1 ) )[0].shape
(59022,)
>>> np.nonzero( np.sum( X1[ ind, : ], axis=1 ) )[0].shape
(59022,)

One thing I am guessing I can try is to use sparse matrices but I'm just wondering if I can make this thing work. I have 256GB of RAM so I don't think memory is a constraint. Thanks for your hints!

like image 273
marcin_j Avatar asked Sep 11 '14 16:09

marcin_j


1 Answers

I strongly suspect your numpy version. I suspect it may be a manifestation of this bug, where you can see that setting a large array to a value silently fails and outputs zeros. Could maybe track it down for definite with numpy versions and a bit more time.

I have written a test script here which should generate datasets similar to those you describe (code copied below for completeness). I cannot reproduce the original issue..

I can set up with a 59022 x 84122 np.array with dtype=np.uint16, but the command of interest gives an out of memory message. So I am memory limited, so can't test the exact values you give.

However, if I drop the width down to 54122, the code works as expected (doesn't output zeros in rows > 7966).

My numpy version is

numpy.version.version == '1.8.2'

My python version and system is as follows:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AM D64)] on win32


Scripting Code

import numpy as np
import os

# Function to make some test data that will fit in memory...
def makeX(ind,width):
    rowcount = len(ind)
    Xret = np.ones((rowcount,width),dtype=np.uint16)
    col0 = ind.copy()
    col0 = col0.reshape((rowcount,1))
    np.random.shuffle(col0)

    for r in range(len(Xret)):
        Xret[r] = bytearray(os.urandom(width))
        Xret[r][0] = col0[r]

    return Xret

X3width = 54122 # if this is 84122, the last line fails with MemoryError on my box 
                # (16GB memory ~13 available)

ind = np.array(range(59022))
X1 = makeX(ind,16)
X3 = makeX(ind,54122)

print('Shapes of ind, X1 and X3')
print(ind.shape)
print(X1.shape)
print(X3.shape)

print('Contents of ind, X1 and X3')
print(ind)
print(X1)
print(X3)

print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3, axis=1 ) )[0].shape)
print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape)

#This outputs (59022,) as expected
like image 97
J Richard Snape Avatar answered Nov 15 '22 08:11

J Richard Snape