I am doing some work with structured arrays in numpy (that I will eventually convert to a pandas dataframe). Now, I generate this structured array by reading in some data (actually memmapping some data) and then filtering it by user specified constraints. I then want to convert this data out of the form that I read it in as (everything is an int to conserve space in the file I read it from) into a more useable format so I can do some unit conversions (i.e. upconvert it to a float). I noticed an interesting artifact (or something) along the way which changing a structured data type. Say that reading in the data results in the same structured array as is created by the following (note that in the actual code the dtype is much longer and much more complex but this suffices for a mwe): <pre class="prettyprint"><code>import numpy as np names = ['foo', 'bar'] formats = ['i4', 'i4'] dtype = np.dtype({'names': names, 'formats': formats}) data = np.array([(1, 2), (3, 4)], dtype=dtype) print(data) print(data.dtype) </code></pre> This creates <pre class="prettyprint"><code>[(1, 2) (3, 4)] [('foo', '<i4'), ('bar', '<i4')] </code></pre> as the structured array Now, say I want to upconvert both of these dtypes to double while also renaming the second component. That seems like it should be easy <pre class="prettyprint"><code>names[1] = 'baz' formats[0] = np.float formats[1] = np.float dtype_new = np.dtype({'names': names, 'formats': formats}) data2 = data.copy().astype(dtype_new) print(data2) print(data2.dtype) </code></pre> but the result is unexpected <pre class="prettyprint"><code>(1.0, 0.0) (3.0, 0.0)] [('foo', '<f8'), ('baz', '<f8')] </code></pre> What happened to the data from the second component? We can do this conversion however if we split things up <pre class="prettyprint"><code>dtype_new3 = np.dtype({'names': names, 'formats': formats}) data3 = data.copy().astype(dtype_new3) print(data3) print(data3.dtype) names[1] = 'baz' data4 = data3.copy() data4.dtype.names = names print(data4) print(data4.dtype) </code></pre> which results in the correct output <pre class="prettyprint"><code>[(1.0, 2.0) (3.0, 4.0)] [('foo', '<f8'), ('bar', '<f8')] [(1.0, 2.0) (3.0, 4.0)] [('foo', '<f8'), ('baz', '<f8')] </code></pre> It appears that when <code>astype</code> is called with a structured dtype, numpy matches the names for each component and then applies the specified type to the contents (just guessing here, didn't look at the source code). Is there anyway to do this conversion all at once (i.e. the name and the upconversion of the format) or does it simply need to be done it steps. (It's not a huge deal if it needs to be done in steps, but it seems odd to me that there's not a single step way to do this.)

There is a library of functions designed to work with <code>recarray</code> (and thus structured arrays). It's kind of hidden so I'll have do a search to find it. It has functions for renaming fields, adding and deleting fields, etc. The general pattern of action is to make a new array with the target dtype, and then copy fields one by one. Since an array usually has many elements and a small number of fields, this doesn't slow things down much. It looks like this <code>astype</code> method is using some of that code, or maybe compiled code that behaves the same way. So yes, it does look like we need change field dtypes and names in separate steps. <pre class="prettyprint"><code>In [1279]: data=np.array([(1,2),(3,4)],dtype='i,i') In [1280]: data Out[1280]: array([(1, 2), (3, 4)], dtype=[('f0', '<i4'), ('f1', '<i4')]) In [1281]: dataf=data.astype('f8,f8') # change dtype, same default names In [1282]: dataf Out[1282]: array([(1.0, 2.0), (3.0, 4.0)], dtype=[('f0', '<f8'), ('f1', '<f8')]) </code></pre> Easy name change: <pre class="prettyprint"><code>In [1284]: dataf.dtype.names=['one','two'] In [1285]: dataf Out[1285]: array([(1.0, 2.0), (3.0, 4.0)], dtype=[('one', '<f8'), ('two', '<f8')]) In [1286]: data.astype(dataf.dtype) Out[1286]: array([(0.0, 0.0), (0.0, 0.0)], dtype=[('one', '<f8'), ('two', '<f8')]) </code></pre> The <code>astype</code> with no match in names produces a <code>zero</code> array, same as <code>np.zeros(data.shape,dataf.dtype)</code>. By matching names, rather than position in the dtype, I can reorder values, and even add fields. <pre class="prettyprint"><code>In [1291]: data.astype([('f1','f8'),('f0','f'),('f3','i')]) Out[1291]: array([(2.0, 1.0, 0), (4.0, 3.0, 0)], dtype=[('f1', '<f8'), ('f0', '<f4'), ('f3', '<i4')]) </code></pre>

Changing numpy structured array dtype names and formats

Tags:

python

python-3.x

numpy

structured-array

I am doing some work with structured arrays in numpy (that I will eventually convert to a pandas dataframe).

Now, I generate this structured array by reading in some data (actually memmapping some data) and then filtering it by user specified constraints. I then want to convert this data out of the form that I read it in as (everything is an int to conserve space in the file I read it from) into a more useable format so I can do some unit conversions (i.e. upconvert it to a float).

I noticed an interesting artifact (or something) along the way which changing a structured data type. Say that reading in the data results in the same structured array as is created by the following (note that in the actual code the dtype is much longer and much more complex but this suffices for a mwe):

import numpy as np

names = ['foo', 'bar']
formats = ['i4', 'i4']

dtype = np.dtype({'names': names, 'formats': formats})

data = np.array([(1, 2), (3, 4)], dtype=dtype)
print(data)
print(data.dtype)

This creates

[(1, 2) (3, 4)]
[('foo', '<i4'), ('bar', '<i4')]

as the structured array

Now, say I want to upconvert both of these dtypes to double while also renaming the second component. That seems like it should be easy

names[1] = 'baz'

formats[0] = np.float
formats[1] = np.float

dtype_new = np.dtype({'names': names, 'formats': formats})

data2 = data.copy().astype(dtype_new)

print(data2)
print(data2.dtype)

but the result is unexpected

(1.0, 0.0) (3.0, 0.0)]
[('foo', '<f8'), ('baz', '<f8')]

What happened to the data from the second component? We can do this conversion however if we split things up

dtype_new3 = np.dtype({'names': names, 'formats': formats})

data3 = data.copy().astype(dtype_new3)

print(data3)
print(data3.dtype)

names[1] = 'baz'
data4 = data3.copy()
data4.dtype.names = names

print(data4)
print(data4.dtype)

which results in the correct output

[(1.0, 2.0) (3.0, 4.0)]
[('foo', '<f8'), ('bar', '<f8')]
[(1.0, 2.0) (3.0, 4.0)]
[('foo', '<f8'), ('baz', '<f8')]

It appears that when astype is called with a structured dtype, numpy matches the names for each component and then applies the specified type to the contents (just guessing here, didn't look at the source code). Is there anyway to do this conversion all at once (i.e. the name and the upconversion of the format) or does it simply need to be done it steps. (It's not a huge deal if it needs to be done in steps, but it seems odd to me that there's not a single step way to do this.)

919

asked Aug 12 '16 14:08

Andrew

2 Answers

There is a library of functions designed to work with recarray (and thus structured arrays). It's kind of hidden so I'll have do a search to find it. It has functions for renaming fields, adding and deleting fields, etc. The general pattern of action is to make a new array with the target dtype, and then copy fields one by one. Since an array usually has many elements and a small number of fields, this doesn't slow things down much.

It looks like this astype method is using some of that code, or maybe compiled code that behaves the same way.

So yes, it does look like we need change field dtypes and names in separate steps.

In [1279]: data=np.array([(1,2),(3,4)],dtype='i,i')
In [1280]: data
Out[1280]: 
array([(1, 2), (3, 4)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
In [1281]: dataf=data.astype('f8,f8')     # change dtype, same default names
In [1282]: dataf
Out[1282]: 
array([(1.0, 2.0), (3.0, 4.0)], 
      dtype=[('f0', '<f8'), ('f1', '<f8')])

Easy name change:

In [1284]: dataf.dtype.names=['one','two'] 
In [1285]: dataf
Out[1285]: 
array([(1.0, 2.0), (3.0, 4.0)], 
      dtype=[('one', '<f8'), ('two', '<f8')])

In [1286]: data.astype(dataf.dtype)
Out[1286]: 
array([(0.0, 0.0), (0.0, 0.0)], 
      dtype=[('one', '<f8'), ('two', '<f8')])

The astype with no match in names produces a zero array, same as np.zeros(data.shape,dataf.dtype). By matching names, rather than position in the dtype, I can reorder values, and even add fields.

In [1291]: data.astype([('f1','f8'),('f0','f'),('f3','i')])
Out[1291]: 
array([(2.0, 1.0, 0), (4.0, 3.0, 0)], 
      dtype=[('f1', '<f8'), ('f0', '<f4'), ('f3', '<i4')])

answered Sep 30 '22 07:09

hpaulj

This seems to work as expected by now on recent numpy versions:

names[1] = 'baz'

formats[0] = float
formats[1] = float

dtype_new = np.dtype({'names': names, 'formats': formats})

data2 = data.copy().astype(dtype_new)

print(data2)
print(data2.dtype)

results in

[(1., 2.) (3., 4.)]
[('foo', '<f8'), ('baz', '<f8')]

It seems like this has to do with a change in numpy to match structured array fields by position instead of by name when doing operations (see numpy PR#6053: “MAINT: struct assignment "by field position", multi-field indices return views”). A relevant bug report for this question seems to be issue #7058: “astype converts numpy array values to 0.0 for structured dtype”.

If this is indeed the relevant change, then the numpy release to fix/implement this should be v1.14.0, see the release notes for numpy 1.14.0: “Changes – Multiple-field indexing/assignment of structured arrays”.

answered Sep 30 '22 05:09

Socob

Related questions
                            
                                Pandas Dataframe.to_csv decimal=',' doesn't work
                            
                                Scrapy gets NoneType Error when using Privoxy Proxy for Tor
                            
                                python logging - message not showing up in child
                            
                                How to pass Variable from Python to VBA Sub
                            
                                Pandas.read_excel: Accessing the home directory
                            
                                python, shapely: How to determine if two polygons cross each other, while allowing their edges to overlap
                            
                                How to filter a pandas series with a datetime index on the quarter and year
                            
                                Adapting binary stacking example to multiclass
                            
                                What is the standard docstring for a django model metaclass?
                            
                                When/How does an anonymous file object close?
                            
                                Split a pandas column of dictionaries into multiple columns
                            
                                Returning a PDF from S3 in Flask
                            
                                Middleware in flask
                            
                                Vectorizing a Nested Loop
                            
                                Restricted set operations on python dictionary key views
                            
                                Formatted string literals in Python 3.6 with tuples
                            
                                Pandas Divide dataframe by index values
                            
                                pandas merge dataframes on closest timestamp
                            
                                How can I remove all non-alphanumeric characters from a string, except for '#', with regex?
                            
                                How many iterations a needed to train tensorflow with the entire MNIST data set (60000 images)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With