How to get SVMs to play nicely with missing data in scikit-learn?

Tags:

I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA). I load the data in with genfromtxt with dtype='f8' and go about training my classifier.

The classification is fine on RandomForestClassifier and GradientBoostingClassifier objects, but using SVC from sklearn.svm causes the following error:

    probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
    X = self._validate_for_predict(X)
  File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
    X = atleast2d_or_csr(X, dtype=np.float64, order="C")
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
    assert_all_finite(X)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
    raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity

What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..

888

asked Jul 11 '12 21:07

Jim

3 Answers

You can do data imputation to handle missing values before using SVM.

EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.

(copied from page and modified)

>>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit(train) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> train_imp = imp.transform(train)

186

answered Sep 29 '22 09:09

Wei

You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.

answered Sep 29 '22 09:09

ogrisel

The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:

from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)

X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
    
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)

answered Sep 29 '22 11:09

Hagbard

Related questions
                            
                                How does python compare functions?
                            
                                Deal with overflow in exp using numpy
                            
                                Handling of duplicate indices in NumPy assignments
                            
                                Python Numpy - Complex Numbers - Is there a function for Polar to Rectangular conversion?
                            
                                Sphinx: force rebuild of html, including autodoc
                            
                                MongoDB not allowing using '.' in key [duplicate]
                            
                                python - check if any value of dict is not None (without iterators)
                            
                                Web scraping - how to access content rendered in JavaScript via Angular.js?
                            
                                keras: what is the difference between model.predict and model.predict_proba
                            
                                Why is deque implemented as a linked list instead of a circular array?
                            
                                How to specify in the pipfile package from custom git branch using pipfile?
                            
                                Deprecation warning from Jupyter: "`should_run_async` will not call `transform_cell` automatically in the future"
                            
                                Concurrency: Are Python extensions written in C/C++ affected by the Global Interpreter Lock?
                            
                                What is the analog for .Net InvalidOperationException in Python?
                            
                                Python and ctypes: how to correctly pass "pointer-to-pointer" into DLL?
                            
                                What are the benefits of pip and virtualenv?
                            
                                Python: Mock side_effect on object attribute
                            
                                append subprocess.Popen output to file?
                            
                                Variable scope and Try Catch in python
                            
                                Cannot install py2exe with Python 2.7

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get SVMs to play nicely with missing data in scikit-learn?

Tags:

python

machine-learning

scikit-learn

Jim

People also ask

3 Answers

Wei

ogrisel

Hagbard

Recent Activity

Donate For Us