I have a large numpy array (typically of order 500,000x1024 but can be larger) and I'm trying to perform a couple of process that depend on where the positive values in the array are. A very small example array might be
[[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1., 0., 0., 1., 5., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 0., 3., 1., 0., 0., 2., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
The first is to replace any zeros between positive values that are less than three columns apart in each row. so if I replace these numbers with 50, my example output would be
[[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 1.,50.,50., 1., 5., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
[ 0., 3., 1.,50.,50., 2., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
The second thing I need to do is to write out some information for each row based on where the ranges of positive values are. For example using my altered array, I need to be able to write out one statement for the third row declaring positive integers for col[1:7] and two statements for the fourth row declaring positive integers in col[1:3] and col[6].
I've managed to utilise the numpy vectorised methods to a point to combat the first task, but have still ended up resorting to looping through both columns and rows (albeit on a subset of the whole array). Otherwise I end up replacing all of the zeros in a given row instead of just those between positive values.
But the second task I can't seem to find a way to do without cycling through the whole array using
for col in arr:
for row in arr:
I guess my overall question would be, is there a way to make use of the vectorised methods in numpy to define column index ranges that will differ for each row and depend on the values in the following column?
Any help would be much appreciated.
Numpy unfortunately can't do much processing without generating more arrays, so I fear any solution will require either some form of manual loop like you've been using, or creating one or more additional big arrays. You may be able to come up with a solution that's quite fast and memory efficient using numexpr.
Here's a stab at doing this in a way that isn't necessarily memory efficient, but at least all the looping will be done by Numpy, so should be a lot faster than what you've been doing as long as it fits in your memory. (Memory efficiency might be improved by rewriting some of this as in-place operations but I won't worry about that.)
Here's your step 1:
positive = x>0 # a boolean array marking the positive values in x
positive0 = positive[:,0:-3] # all but last 3 columns
positive1 = positive[:,1:-2] # all but 1st and last 2 columns; not actually used
positive2 = positive[:,2:-1] # all but first 2 and last 1 columns
positive3 = positive[:,3: ] # all but first 3 columns
# In the following, the suffix 1 indicates that we're viewing things from the perspective
# of entries in positive1 above. So, e.g., has_pos_1_to_left1 will be True at
# any position where an entry in positive1 would be preceded by a positive entry in x
has_pos_1_to_left1 = positive0
has_pos_1_or_2_to_right1 = positive2 | positive3
flanked_by_positives1 = has_pos_1_to_left1 & has_pos_1_or_2_to_right1
zeros = (x == 0) # indicates everywhere x is 0
zeros1 = zeros[:,1:-2] # all but 1st and last 2 columns
x1 = x[:,1:-2] # all but 1st and last 2 columns
x1[zeros1 & flanked_by_positives1] = 50 # fill in zeros that were flanked - overwrites x!
# The preceding didn't address the next to last column, b/c we couldn't
# look two slots to the right of it without causing error. Needs special treatment:
x[:,-2][ zeros[:,-2] & positive[:,-1] & (positive[:,-4] or positive[:,-3])] = 50
And here's your step 2:
filled_positives = x>0 # assuming we just filled in x
diffs = numpy.diff(filled_positives) # will be 1 at first positive in any sequence,
# -1 after last positive, zero elsewhere
endings = numpy.where(diffs==-1) # tuple specifying coords where positive sequences end
# omits final column!!!
beginnings = numpy.where(diffs==1) # tuple specifying coords where pos seqs about to start
# omits column #0!!!
It should be straightforward to use these beginning and ending coordinates to extract the information about each row you said you needed, but remember that this difference-detecting method only catches transitions from non-positive to positive, or vice versa, so it won't mention positive sequences beginning in the zeroth column or ending in the last column, so you'll need to look for those non-transitions separately if you want them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With