Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalize/Standardize a numpy recarray

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).

a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)

As you can see, I cannot e.g. process a[:,:-1] as the shape is one-dimensional.

The best I found is to iterate over all columns:

for nam in a.dtype.names[:-1]:
    col = a[nam]
    a[nam] = (col - col.min()) / (col.max() - col.min())

Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

like image 524
Has QUIT--Anony-Mousse Avatar asked Mar 19 '12 18:03

Has QUIT--Anony-Mousse


People also ask

How do I normalize a NumPy?

In order to normalize a vector in NumPy, we can use the np. linalg. norm() function, which returns the vector's norm value. We can then use the norm value to divide each value in the array to get the normalized array.

How do you normalize an array in NumPy in Python?

To normalize a 2D-Array or matrix we need NumPy library. For matrix, general normalization is using The Euclidean norm or Frobenius norm. Here, v is the matrix and |v| is the determinant or also called The Euclidean norm. v-cap is the normalized matrix.

How do you normalize an NP array between 0 and 1?

Normalization using Min Max Values Here normalization of data can be done by subtracting the data with the minimum value in the data and dividing the result by the difference between the maximum value and the minimum value in the given data.

How do you normalize a NumPy array to a unit vector?

Using scikit-learn normalize() method The first option we have when it comes to normalising a numpy array is sklearn. preprocessing. normalize() method that can be used to scale input vectors individually to unit norm (vector length).


2 Answers

There are a number of ways to do it, but some are cleaner than others.

Usually, in numpy, you keep the string data in a separate array.

(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)

Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).

However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']])

At any rate, one way is to do something like this:

import numpy as np

data = np.recfromcsv('iris.csv')

# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])

float_dat = data[float_fields]

# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))

# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)

However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.

Incidentally, using pandas, you'd do something like this:

import pandas
data = pandas.read_csv('iris.csv', header=None)

float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)

data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)
like image 154
Joe Kington Avatar answered Oct 14 '22 08:10

Joe Kington


What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as test.txt:

last,first,country,state,zip
tyson,mike,USA,Nevada,89146
brady,tom,USA,Massachusetts,02035

When I then execute the following code, this is what I get:

>>> import numpy as np
>>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None)
>>> print a.shape
(3,5)
>>> print a
[['last' 'first' 'country' 'state' 'zip']
 ['tyson' 'mike' 'USA' 'Nevada' '89146']
 ['brady' 'tom' 'USA' 'Massachusetts' '02035']]
>>> print a[0,:-1]
['last' 'first' 'country' 'state']
>>> print a.dtype.names
None

I'm just wondering what's different about your data.

like image 30
ely Avatar answered Oct 14 '22 06:10

ely