Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get SVMs to play nicely with missing data in scikit-learn?

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA). I load the data in with genfromtxt with dtype='f8' and go about training my classifier.

The classification is fine on RandomForestClassifier and GradientBoostingClassifier objects, but using SVC from sklearn.svm causes the following error:

    probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
    X = self._validate_for_predict(X)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
    X = atleast2d_or_csr(X, dtype=np.float64, order="C")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
    assert_all_finite(X)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
    raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

like image 888
Jim Avatar asked Jul 11 '12 21:07

Jim


People also ask

Does SVM work with missing values?

Although SVMs are an attractive option when constructing a classifier, SVMs do not easily accommodate missing covariate information. Similar to other prediction and classification methods, in-attention to missing data when constructing an SVM can impact the accuracy and utility of the resulting classifier.

How do you ignore missing values in Python?

The possible ways to do this are: Filling the missing data with the mean or median value if it's a numerical variable. Filling the missing data with mode if it's a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data.


3 Answers

You can do data imputation to handle missing values before using SVM.

EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.

(copied from page and modified)

>>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit(train) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> train_imp = imp.transform(train) 
like image 186
Wei Avatar answered Sep 29 '22 09:09

Wei


You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.

like image 25
ogrisel Avatar answered Sep 29 '22 09:09

ogrisel


The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:

from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)

X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
    
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)
like image 36
Hagbard Avatar answered Sep 29 '22 11:09

Hagbard