Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

classifiers in scikit-learn that handle nan/null

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]]) y_train = np.array([1, 2]) clf = RandomForestRegressor(X_train, y_train) X_test = np.array([7, 8, np.nan]) y_pred = clf.predict(X_test) # Fails! 

Can I not call predict with any scikit-learn algorithm with missing values?

Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

like image 266
anthonybell Avatar asked May 19 '15 05:05

anthonybell


People also ask

Can random forest handle NaN values?

Random Forest for Missing Values The Random Forests are pretty capable of scaling to significant data settings, and these are robust to the non-linearity of data and can handle outliers. Random Forests can hold mixed-type of data ( both numerical and categorical).

Can Sklearn handle missing values?

Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings. >>> import numpy as np >>> from sklearn.

Can Scikit-learn random forest handle missing values?

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Can Random Forest classifier handle null values?

A RFC is a collection of trees, each independently grown using labeled and complete input training data. By complete we explicitly mean that there are no missing values i.e. NULL or NaN values. But in practice the data often can have (many) missing values.


2 Answers

I made an example that contains both missing values in training and the test sets

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function  import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer   X_train = [[0, 0, np.nan], [np.nan, 1, 1]] Y_train = [0, 1] X_test_1 = [0, 0, np.nan] X_test_2 = [0, np.nan, np.nan] X_test_3 = [np.nan, 1, 1]  # Create our imputer to replace missing values with the mean e.g. imp = SimpleImputer(missing_values=np.nan, strategy='mean') imp = imp.fit(X_train)  # Impute our data, then train X_train_imp = imp.transform(X_train) clf = RandomForestClassifier(n_estimators=10) clf = clf.fit(X_train_imp, Y_train)  for X_test in [X_test_1, X_test_2, X_test_3]:     # Impute each test item, then predict     X_test_imp = imp.transform(X_test)     print(X_test, '->', clf.predict(X_test_imp))  # Results [0, 0, nan] -> [0] [0, nan, nan] -> [0] [nan, 1, 1] -> [1] 
like image 122
bakkal Avatar answered Oct 13 '22 18:10

bakkal


Short answer

Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.


More on scikit-learn and XGBoost

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Consider situtations when imputation doesn't make sense.

keep in mind this is a made-up example

Consider a dataset with rows of cars ("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties (Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).

Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electric should be a NaN-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

like image 34
DannyDannyDanny Avatar answered Oct 13 '22 18:10

DannyDannyDanny