classifiers in scikit-learn that handle nan/null

Tags:

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]]) y_train = np.array([1, 2]) clf = RandomForestRegressor(X_train, y_train) X_test = np.array([7, 8, np.nan]) y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

266

asked May 19 '15 05:05

anthonybell

2 Answers

I made an example that contains both missing values in training and the test sets

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function  import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer   X_train = [[0, 0, np.nan], [np.nan, 1, 1]] Y_train = [0, 1] X_test_1 = [0, 0, np.nan] X_test_2 = [0, np.nan, np.nan] X_test_3 = [np.nan, 1, 1]  # Create our imputer to replace missing values with the mean e.g. imp = SimpleImputer(missing_values=np.nan, strategy='mean') imp = imp.fit(X_train)  # Impute our data, then train X_train_imp = imp.transform(X_train) clf = RandomForestClassifier(n_estimators=10) clf = clf.fit(X_train_imp, Y_train)  for X_test in [X_test_1, X_test_2, X_test_3]:     # Impute each test item, then predict     X_test_imp = imp.transform(X_test)     print(X_test, '->', clf.predict(X_test_imp))  # Results [0, 0, nan] -> [0] [0, nan, nan] -> [0] [nan, 1, 1] -> [1]

122

answered Oct 13 '22 18:10

bakkal

Short answer

Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.

More on scikit-learn and XGBoost

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Consider situtations when imputation doesn't make sense.

keep in mind this is a made-up example

Consider a dataset with rows of cars ("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties (Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).

Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electric should be a NaN-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

answered Oct 13 '22 18:10

DannyDannyDanny

Related questions
                            
                                Kill python interpeter in linux from the terminal
                            
                                Python: MySQLdb and "Library not loaded: libmysqlclient.16.dylib"
                            
                                Django manage.py Unknown command: 'syncdb'
                            
                                How to send html email with django with dynamic content in it?
                            
                                5 maximum values in a python dictionary
                            
                                Python conditional assignment operator
                            
                                how to get the same day of next month of a given day in python using datetime
                            
                                Item frequency count in Python
                            
                                Checking if a number is prime in Python [duplicate]
                            
                                Python crashing on MacOS 10.15 Beta (19A582a) with "/usr/lib/libcrypto.dylib"
                            
                                How to handle incoming real time data with python pandas
                            
                                Best output type and encoding practices for __repr__() functions?
                            
                                TypeError: Object of type 'float32' is not JSON serializable [duplicate]
                            
                                Why can't you add attributes to object in python? [duplicate]
                            
                                What benefit is added by using Gunicorn + Nginx + Flask? [duplicate]
                            
                                Is there an analogue to Java IllegalStateException in Python?
                            
                                using __init__.py
                            
                                Combine Pool.map with shared memory Array in Python multiprocessing
                            
                                Subprocess check_output returned non-zero exit status 1
                            
                                How can I sandbox Python in pure Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

classifiers in scikit-learn that handle nan/null

Tags:

python

pandas

nan

machine-learning

scikit-learn