rearranging rows in a big numpy array zeros some rows. How to fix it?

Question

I am working with numpy and the following data (all matrices have all cells nonegative):

>>> X1.shape
(59022, 16)
>>> X3.shape
(59022, 84122)
>>> ind.shape
(59022,)
>>> np.max( ind )
59021
>>> np.min( ind )
0
>>> len( set ( ind.tolist() ) )
59022

In short, ind is simply a way to rearrange rows in either matrix. The problem is that while rearranging the rows in the smaller array (X1) works as desired, the same operation on the bigger array (X2) leads to all rows below a certain point be zero. Here is what I do:

>>> np.nonzero( np.sum( X3, axis=1 ) )[0].shape
(59022,)

Now let's see what happens if the rows are rearranged:

>>> np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape
(7966,)

But for the smaller matrix everything works just fine:

>>> np.nonzero( np.sum( X1, axis=1 ) )[0].shape
(59022,)
>>> np.nonzero( np.sum( X1[ ind, : ], axis=1 ) )[0].shape
(59022,)

One thing I am guessing I can try is to use sparse matrices but I'm just wondering if I can make this thing work. I have 256GB of RAM so I don't think memory is a constraint. Thanks for your hints!

J Richard Snape · Accepted Answer

I strongly suspect your numpy version. I suspect it may be a manifestation of this bug, where you can see that setting a large array to a value silently fails and outputs zeros. Could maybe track it down for definite with numpy versions and a bit more time.

I have written a test script here which should generate datasets similar to those you describe (code copied below for completeness). I cannot reproduce the original issue..

I can set up with a 59022 x 84122 np.array with dtype=np.uint16, but the command of interest gives an out of memory message. So I am memory limited, so can't test the exact values you give.

However, if I drop the width down to 54122, the code works as expected (doesn't output zeros in rows > 7966).

My numpy version is

numpy.version.version == '1.8.2'

My python version and system is as follows:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AM D64)] on win32

Scripting Code

import numpy as np
import os

# Function to make some test data that will fit in memory...
def makeX(ind,width):
    rowcount = len(ind)
    Xret = np.ones((rowcount,width),dtype=np.uint16)
    col0 = ind.copy()
    col0 = col0.reshape((rowcount,1))
    np.random.shuffle(col0)

    for r in range(len(Xret)):
        Xret[r] = bytearray(os.urandom(width))
        Xret[r][0] = col0[r]

    return Xret

X3width = 54122 # if this is 84122, the last line fails with MemoryError on my box 
                # (16GB memory ~13 available)

ind = np.array(range(59022))
X1 = makeX(ind,16)
X3 = makeX(ind,54122)

print('Shapes of ind, X1 and X3')
print(ind.shape)
print(X1.shape)
print(X3.shape)

print('Contents of ind, X1 and X3')
print(ind)
print(X1)
print(X3)

print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3, axis=1 ) )[0].shape)
print('Shape of np.nonzero( np.sum( X3, axis=1 ) )[0]')
print(np.nonzero( np.sum( X3[ ind, : ], axis=1 ) )[0].shape)

#This outputs (59022,) as expected

rearranging rows in a big numpy array zeros some rows. How to fix it?

Tags:

python

numpy

marcin_j

1 Answers

Scripting Code

J Richard Snape

Recent Activity

Donate For Us

rearranging rows in a big numpy array zeros some rows. How to fix it?

Tags:

python

numpy

marcin_j

1 Answers

Scripting Code

J Richard Snape

Related questions

Recent Activity

Donate For Us