I'm having difficulty converting a structured array loaded from a CSV using np.genfromtxt
into a np.array
in order to fit the data to a Scikit-Learn estimator. The problem is that at some point a cast from the structured array to a regular array will occur resulting in a ValueError: can't cast from structure to non-structure
. For a long time, I had been using .view
to perform the conversion but this has resulted in a number of deprecation warnings from NumPy. The code is as follows:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
data = np.genfromtxt(path, dtype=float, delimiter=',', names=True)
target = "occupancy"
features = [
"temperature", "relative_humidity", "light", "C02", "humidity"
]
# Doesn't work directly
X = data[features]
y = data[target].astype(int)
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
The exception being raised is: ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.
My second attempt was to use a view as follows:
# View is raising deprecation warnings
X = data[features]
X = X.view((float, len(X.dtype.names)))
y = data[target].astype(int)
Which works and does exactly what I want it to do (I don't need a copy of the data), but results in deprecation warnings:
FutureWarning: Numpy has detected that you may be viewing or writing to
an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.15 because this will return a view
instead of a copy -- see release notes for details.
At the moment we're using tolist()
to convert the structured array to a list and then to a np.array
. This works, however it seems terribly inefficient:
# Current method (efficient?)
X = np.array(data[features].tolist())
y = data[target].astype(int)
There has to be a better way, I'd appreciate any advice.
NOTE: The data for this example is from the UCI ML Occupancy Repository and the data appears as follows:
array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.),
(nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.),
(nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.), ...,
(nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
(nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.),
(nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)],
dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'),
('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
Test whether all array elements along a given axis evaluate to True. Input array or object that can be converted to an array. Axis or axes along which a logical AND reduction is performed.
You could avoid the need for copying if you can read the data into a plain NumPy array first (by omitting the names
parameter):
data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)
Then (lucky for us), X
is composed of all but the first and last columns (i.e. omitting the datetime
and occupancy
columns). So we can express X
and y
as slices:
X = data[:, 1:-1]
y = data[:, -1].astype(int)
Then we can pass these to scikit-learn functions easily:
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
and, if we wish, we can view the plain NumPy array as a structured array afterwards:
features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])
Unfortunately, this workaround relies on X
being expressible as a slice -- we wouldn't be able to avoid copying if occupancy
showed up in between the other feature colums for instance. It also means you have to define X
using X = data[:, 1:-1]
instead of the more humanly-understandable X = data[features]
.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)
X = data[:, 1:-1]
y = data[:, -1].astype(int)
clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)
features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])
If you must start with the structured array, then hpaulj's answer shows how to view/reshape/slice
the structured array to obtain a plain array without copying:
import numpy as np
nan = np.nan
data = np.array([(nan, 23.18, 27.272 , 426. , 721.25, 0.00479299, 1.),
(nan, 23.15, 27.2675, 429.5 , 714. , 0.00478344, 1.),
(nan, 23.15, 27.245 , 426. , 713.5 , 0.00477946, 1.),
(nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
(nan, 20.89, 28.0225, 418.75, 1632. , 0.00427949, 1.),
(nan, 21. , 28.1 , 409. , 1864. , 0.00432073, 1.)],
dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'),
('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])
target = 'occupancy'
nrows = len(data)
X = data.view('<f8').reshape(nrows, -1)[:, 1:-1]
y = data[target].astype(int)
This takes advantage of the fact that each field is 8 bytes long. So it is easy to convert the structured array to a plain array of dtype <f8
. Reshaping makes it a 2D array with the same number of rows. Slicing removes the datetime
and occupancy
column/fields from the array.
Add a .copy()
to data[features]
:
X = data[features].copy()
X = X.view((float, len(X.dtype.names)))
and the FutureWarning
message is gone.
This should be more efficient than converting to a list first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With