Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Convert structured array to numpy array for use with Scikit-Learn

I'm having difficulty converting a structured array loaded from a CSV using np.genfromtxt into a np.array in order to fit the data to a Scikit-Learn estimator. The problem is that at some point a cast from the structured array to a regular array will occur resulting in a ValueError: can't cast from structure to non-structure. For a long time, I had been using .view to perform the conversion but this has resulted in a number of deprecation warnings from NumPy. The code is as follows:

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

data = np.genfromtxt(path, dtype=float, delimiter=',', names=True)

target = "occupancy"
features = [
    "temperature", "relative_humidity", "light", "C02", "humidity"

# Doesn't work directly
X = data[features]
y = data[target].astype(int)

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

The exception being raised is: ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.

My second attempt was to use a view as follows:

# View is raising deprecation warnings
X = data[features]
X = X.view((float, len(X.dtype.names)))
y = data[target].astype(int)

Which works and does exactly what I want it to do (I don't need a copy of the data), but results in deprecation warnings:

FutureWarning: Numpy has detected that you may be viewing or writing to 
an array returned by selecting multiple fields in a structured array.

This code may break in numpy 1.15 because this will return a view 
instead of a copy -- see release notes for details.

At the moment we're using tolist() to convert the structured array to a list and then to a np.array. This works, however it seems terribly inefficient:

# Current method (efficient?)
X = np.array(data[features].tolist())
y = data[target].astype(int)

There has to be a better way, I'd appreciate any advice.

NOTE: The data for this example is from the UCI ML Occupancy Repository and the data appears as follows:

array([(nan, 23.18, 27.272 , 426.  ,  721.25, 0.00479299, 1.),
       (nan, 23.15, 27.2675, 429.5 ,  714.  , 0.00478344, 1.),
       (nan, 23.15, 27.245 , 426.  ,  713.5 , 0.00477946, 1.), ...,
       (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
       (nan, 20.89, 28.0225, 418.75, 1632.  , 0.00427949, 1.),
       (nan, 21.  , 28.1   , 409.  , 1864.  , 0.00432073, 1.)],
      dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), 
             ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])
like image 842
bbengfort Avatar asked Mar 03 '18 14:03


People also ask

Can Sklearn use NumPy arrays?

Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.

What does .all do NumPy?

Test whether all array elements along a given axis evaluate to True. Input array or object that can be converted to an array. Axis or axes along which a logical AND reduction is performed.

2 Answers

You could avoid the need for copying if you can read the data into a plain NumPy array first (by omitting the names parameter):

data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)

Then (lucky for us), X is composed of all but the first and last columns (i.e. omitting the datetime and occupancy columns). So we can express X and y as slices:

X = data[:, 1:-1]
y = data[:, -1].astype(int)

Then we can pass these to scikit-learn functions easily:

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

and, if we wish, we can view the plain NumPy array as a structured array afterwards:

features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])

Unfortunately, this workaround relies on X being expressible as a slice -- we wouldn't be able to avoid copying if occupancy showed up in between the other feature colums for instance. It also means you have to define X using X = data[:, 1:-1] instead of the more humanly-understandable X = data[features].

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)

X = data[:, 1:-1]
y = data[:, -1].astype(int)

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])

If you must start with the structured array, then hpaulj's answer shows how to view/reshape/slice the structured array to obtain a plain array without copying:

import numpy as np
nan = np.nan
data = np.array([(nan, 23.18, 27.272 , 426.  ,  721.25, 0.00479299, 1.),
       (nan, 23.15, 27.2675, 429.5 ,  714.  , 0.00478344, 1.),
       (nan, 23.15, 27.245 , 426.  ,  713.5 , 0.00477946, 1.), 
       (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
       (nan, 20.89, 28.0225, 418.75, 1632.  , 0.00427949, 1.),
       (nan, 21.  , 28.1   , 409.  , 1864.  , 0.00432073, 1.)],
      dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), 
             ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])

target = 'occupancy'
nrows = len(data)
X = data.view('<f8').reshape(nrows, -1)[:, 1:-1]
y = data[target].astype(int)

This takes advantage of the fact that each field is 8 bytes long. So it is easy to convert the structured array to a plain array of dtype <f8. Reshaping makes it a 2D array with the same number of rows. Slicing removes the datetime and occupancy column/fields from the array.

like image 96
unutbu Avatar answered Oct 21 '22 22:10


Add a .copy() to data[features]:

X = data[features].copy()
X = X.view((float, len(X.dtype.names)))

and the FutureWarning message is gone.

This should be more efficient than converting to a list first.

like image 22
Mike Müller Avatar answered Oct 22 '22 00:10

Mike Müller