Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using np.view() with changes to structured arrays in numpy 1.14

I have a numpy structured array with a mixed dtype (i.e., floats, ints, and strings). I want to select some of the columns of the array (all of which contain only floats) and then get the sum, by column, of the rows, as a standard numpy array. The initial array takes a form comparable to:

some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], 
                     dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])

For this example, I'd like to take the sum of columns A and B, yielding np.array([7.5, 11.15]). With numpy ≤1.13, I could do that as follows:

get_cols = ['A', 'B']
desired_sum = np.sum(some_data[get_cols].view(('<f8', len(get_cols))), axis=0)

With the release of numpy 1.14, this method now fails with ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged, which is a result of the changes made in numpy 1.14 to the handling of structured arrays. (User bbengfort commented about the FutureWarning given about this change in this answer.)

In light of these changes to structured arrays, how can I obtain the desired sum from the structured array subset?

like image 817
trynthink Avatar asked Oct 28 '22 20:10

trynthink


1 Answers

In [165]: some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
     ...:                      
In [166]: get_cols = ['A','B']
In [167]: some_data[get_cols]
Out[167]: 
array([( 3.5,  2.15), ( 2.8,  5.3 ), ( 1.2,  3.7 )],
      dtype=[('A', '<f8'), ('B', '<f8')])

Simply reading the field values is fine. In 1.13 we get a warning

In [168]: some_data[get_cols].view(('<f8', len(get_cols)))
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array. 

This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
  #!/usr/bin/python3
Out[168]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

With the recommended copy, no warning:

In [169]: some_data[get_cols].copy().view(('<f8', len(get_cols)))
Out[169]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])
In [171]: np.sum(_, axis=0)
Out[171]: array([  7.5 ,  11.15])

In your original array,

dtype([('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])

An A,B slice would have the two f8 items interspersed with the 20U items. Changing the view dtype of such a mix is problematic. That's why working with a copy is more reliable.

Since U20 takes up 4*20 bytes, the total itemsize is 96, a multiple of 8. We can convert the whole thing to f8, reshape and 'throw-away' the U20 columns:

In [183]: some_data.view('f8').reshape(3,-1)[:,-2:]
Out[183]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

It's not very pretty and I don't recommend it, but it may give some insight into how structured data is arranged.

view on a structured array is useful at times, but often a bit tricky to use correctly.

If the 2 numeric fields are usually used together, I'd recommend a compound dtype like:

In [184]: some_data = np.array([('foo', [3.5, 2.15]), ('bar', [2.8, 5.3]), ('baz
     ...: ', [1.2, 3.7])], 
     ...:                      dtype=[('col1', '<U20'), ('AB', '<f8',(2,))])
     ...:                      
     ...:                      
In [185]: some_data
Out[185]: 
array([('foo', [ 3.5 ,  2.15]), ('bar', [ 2.8 ,  5.3 ]),
       ('baz', [ 1.2 ,  3.7 ])],
      dtype=[('col1', '<U20'), ('AB', '<f8', (2,))])
In [186]: some_data['AB']
Out[186]: 
array([[ 3.5 ,  2.15],
       [ 2.8 ,  5.3 ],
       [ 1.2 ,  3.7 ]])

genfromtxt accepts this style of dtype.

like image 84
hpaulj Avatar answered Nov 15 '22 05:11

hpaulj