I am using scikit-learn for some data analysis, and my dataset has some missing values (represented by NA
). I load the data in with genfromtxt
with dtype='f8'
and go about training my classifier.
The classification is fine on RandomForestClassifier
and GradientBoostingClassifier
objects, but using SVC
from sklearn.svm
causes the following error:
probas = classifiers[i].fit(train[traincv], target[traincv]).predict_proba(train[testcv])
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 409, in predict_proba
X = self._validate_for_predict(X)
File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 534, in _validate_for_predict
X = atleast2d_or_csr(X, dtype=np.float64, order="C")
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 84, in atleast2d_or_csr
assert_all_finite(X)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 20, in assert_all_finite
raise ValueError("array contains NaN or infinity")
ValueError: array contains NaN or infinity
What gives? How can I make the SVM play nicely with the missing data? Keeping in mind that the missing data works fine for random forests and other classifiers..
Although SVMs are an attractive option when constructing a classifier, SVMs do not easily accommodate missing covariate information. Similar to other prediction and classification methods, in-attention to missing data when constructing an SVM can impact the accuracy and utility of the resulting classifier.
The possible ways to do this are: Filling the missing data with the mean or median value if it's a numerical variable. Filling the missing data with mode if it's a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data.
You can do data imputation to handle missing values before using SVM.
EDIT: In scikit-learn, there's a really easy way to do this, illustrated on this page.
(copied from page and modified)
>>> import numpy as np >>> from sklearn.preprocessing import Imputer >>> # missing_values is the value of your placeholder, strategy is if you'd like mean, median or mode, and axis=0 means it calculates the imputation based on the other feature values for that sample >>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imp.fit(train) Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0) >>> train_imp = imp.transform(train)
You can either remove the samples with missing features or replace the missing features with their column-wise medians or means.
The most popular answer here is outdated. "Imputer" is now "SimpleImputer". The current way to solve this issue is given here. Imputing the training and testing data worked for me as follows:
from sklearn import svm
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(x_train)
X_train_imp = imp.transform(x_train)
X_test_imp = imp.transform(x_test)
clf = svm.SVC()
clf = clf.fit(X_train_imp, y_train)
predictions = clf.predict(X_test_imp)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With