I wonder what the best way of normalizing/standardizing a numpy <code>recarray</code> is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels). <pre class="prettyprint"><code>a = np.genfromtxt("iris.csv", delimiter=",", dtype=None) print a.shape > (150,) </code></pre> As you can see, I cannot e.g. process <code>a[:,:-1]</code> as the shape is one-dimensional. The best I found is to iterate over all columns: <pre class="prettyprint"><code>for nam in a.dtype.names[:-1]: col = a[nam] a[nam] = (col - col.min()) / (col.max() - col.min()) </code></pre> Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as <code>test.txt</code>: <pre class="prettyprint"><code>last,first,country,state,zip tyson,mike,USA,Nevada,89146 brady,tom,USA,Massachusetts,02035 </code></pre> When I then execute the following code, this is what I get: <pre class="prettyprint"><code>>>> import numpy as np >>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None) >>> print a.shape (3,5) >>> print a [['last' 'first' 'country' 'state' 'zip'] ['tyson' 'mike' 'USA' 'Nevada' '89146'] ['brady' 'tom' 'USA' 'Massachusetts' '02035']] >>> print a[0,:-1] ['last' 'first' 'country' 'state'] >>> print a.dtype.names None </code></pre> I'm just wondering what's different about your data.

Normalize/Standardize a numpy recarray

Tags:

python

numpy

scipy

normalize

recarray

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).

a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)

As you can see, I cannot e.g. process a[:,:-1] as the shape is one-dimensional.

The best I found is to iterate over all columns:

for nam in a.dtype.names[:-1]:
    col = a[nam]
    a[nam] = (col - col.min()) / (col.max() - col.min())

Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

524

asked Mar 19 '12 18:03

Has QUIT--Anony-Mousse

2 Answers

There are a number of ways to do it, but some are cleaner than others.

Usually, in numpy, you keep the string data in a separate array.

(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)

Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).

However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']])

At any rate, one way is to do something like this:

import numpy as np

data = np.recfromcsv('iris.csv')

# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])

float_dat = data[float_fields]

# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))

# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)

However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.

Incidentally, using pandas, you'd do something like this:

import pandas
data = pandas.read_csv('iris.csv', header=None)

float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)

data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)

154

answered Oct 14 '22 08:10

Joe Kington

What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as test.txt:

last,first,country,state,zip
tyson,mike,USA,Nevada,89146
brady,tom,USA,Massachusetts,02035

When I then execute the following code, this is what I get:

>>> import numpy as np
>>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None)
>>> print a.shape
(3,5)
>>> print a
[['last' 'first' 'country' 'state' 'zip']
 ['tyson' 'mike' 'USA' 'Nevada' '89146']
 ['brady' 'tom' 'USA' 'Massachusetts' '02035']]
>>> print a[0,:-1]
['last' 'first' 'country' 'state']
>>> print a.dtype.names
None

I'm just wondering what's different about your data.

answered Oct 14 '22 06:10

ely

Related questions
                            
                                How to create a Python class decorator that is able to wrap instance, class and static methods?
                            
                                String reverse in Python
                            
                                Can I statically link Cython modules into an executable which embeds python?
                            
                                How to enable math in sphinx?
                            
                                Vim python support with non system python
                            
                                Preferable way to automatically update SSH config file using Python?
                            
                                Why does select.select() work with disk files but not epoll()?
                            
                                How to pass a function pointer to an external program in Cython
                            
                                Python imaging library show() on Windows
                            
                                In python's argparse module, how can I disable printing subcommand choices between curly brackets?
                            
                                wget: How do I specify both --directory-prefix AND --output-document
                            
                                Some questions regarding Mako modules, Mako's TemplateLookup function, and Pyramid
                            
                                Python MysqlDB using cursor.rowcount with SSDictCursor returning wrong count
                            
                                Import module in another directory from a "parallel" sub-directory
                            
                                How to access Django message framework content in Django unit tests
                            
                                Why is turtle lightening pixels?
                            
                                What are the best practices for creating Python Distributions(eggs) on(and for) Multiple Operating Systems
                            
                                Sort list of tuples considering locale (swedish ordering)
                            
                                How to save generated PDF with Reportlab to Datastore in App Engine Python
                            
                                trim big log file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With