I wonder what the best way of normalizing/standardizing a numpy recarray
is.
To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).
a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)
As you can see, I cannot e.g. process a[:,:-1]
as the shape is one-dimensional.
The best I found is to iterate over all columns:
for nam in a.dtype.names[:-1]:
col = a[nam]
a[nam] = (col - col.min()) / (col.max() - col.min())
Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?
In order to normalize a vector in NumPy, we can use the np. linalg. norm() function, which returns the vector's norm value. We can then use the norm value to divide each value in the array to get the normalized array.
To normalize a 2D-Array or matrix we need NumPy library. For matrix, general normalization is using The Euclidean norm or Frobenius norm. Here, v is the matrix and |v| is the determinant or also called The Euclidean norm. v-cap is the normalized matrix.
Normalization using Min Max Values Here normalization of data can be done by subtracting the data with the minimum value in the data and dividing the result by the difference between the maximum value and the minimum value in the given data.
Using scikit-learn normalize() method The first option we have when it comes to normalising a numpy array is sklearn. preprocessing. normalize() method that can be used to scale input vectors individually to unit norm (vector length).
There are a number of ways to do it, but some are cleaner than others.
Usually, in numpy, you keep the string data in a separate array.
(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)
Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas
provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).
However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']]
)
At any rate, one way is to do something like this:
import numpy as np
data = np.recfromcsv('iris.csv')
# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])
float_dat = data[float_fields]
# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))
# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)
However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.
Incidentally, using pandas
, you'd do something like this:
import pandas
data = pandas.read_csv('iris.csv', header=None)
float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)
data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)
What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as test.txt
:
last,first,country,state,zip
tyson,mike,USA,Nevada,89146
brady,tom,USA,Massachusetts,02035
When I then execute the following code, this is what I get:
>>> import numpy as np
>>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None)
>>> print a.shape
(3,5)
>>> print a
[['last' 'first' 'country' 'state' 'zip']
['tyson' 'mike' 'USA' 'Nevada' '89146']
['brady' 'tom' 'USA' 'Massachusetts' '02035']]
>>> print a[0,:-1]
['last' 'first' 'country' 'state']
>>> print a.dtype.names
None
I'm just wondering what's different about your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With