Prepare scipy.io.loadarff result for scikit-learn

Question

I'm trying to use scikit-learn with .arff files. Consider the following code:

from sklearn.ensemble import RandomForestClassifier
from scipy.io.arff import loadarff

import scipy as sp
import numpy as np

dataset = loadarff(open('iris.arff','r'))
target = np.array(dataset[0]['class'])
train = np.array(dataset[0][['sepallength', 'sepalwidth', 'petallength', 'petalwidth']])
rf = RandomForestClassifier(n_estimators = 20, n_jobs = 8)
rf.fit(train, target)

It returns the following error:

ValueError: need more than 1 value to unpack

I assume this has to do with the fact that train is an array of tuples rather than lists (or arrays?); inspecting sklearn.datasets.load_iris() reveals an array of lists (arrays?) that works successfully with the RandomForestClassifier.

Fred Foo · Accepted Answer

The docs for RandomForestClassifier will tell you that fit takes as its X argument a 2-d array of shape (n_samples, n_features), but what you have is indeed a 1-d array:

>>> target.shape
(150,)
>>> train.shape
(150,)

Surprisingly, the contents of this array aren't tuples but a type I've never encountered before:

>>> train[0]
(5.1, 3.5, 1.4, 0.2)
>>> type(train[0])
<type 'numpy.void'>

This type is undocumented and responds rather strangely to asarray and astype, but converting to list-of-lists and back to array does the trick:

>>> X = np.asarray(train.tolist(), dtype=np.float32)
>>> X.shape
(150, 4)
>>> rf.fit(X, target)
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=20, n_jobs=8,
            oob_score=False, random_state=None, verbose=0)

kleino · Answer

Something seems to have changed since april, loadarff now returns a tuple of ndarray and MetaData

with open('training_set.arff','r') as f:
    data, meta = loadarff(f)

print(type(data)) # <class 'numpy.ndarray'> 
print(type(meta)) # <class 'scipy.io.arff.arffread.MetaData'>

More specifically, data seems to be a record array. Converting it to a normal numpy array can be done with the following snippet

train_data = data[meta.names()[:-1]] #everything but the last column
train_data = train_data.view(np.float).reshape(data.shape + (-1,)) #converts the record array to a normal numpy array

Prepare scipy.io.loadarff result for scikit-learn

Tags:

python

scipy

scikit-learn

Hugo Sereno Ferreira

2 Answers

Fred Foo

kleino

Recent Activity

Donate For Us

Prepare scipy.io.loadarff result for scikit-learn

Tags:

python

scipy

scikit-learn

Hugo Sereno Ferreira

2 Answers

Fred Foo

kleino

Related questions

Recent Activity

Donate For Us