I read many questions similar to this but still can not figure this out.
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
X_to_predict = array([[ 1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001,
9.24341823e+001, 1.11457878e+002, -1.90236954e+001]])
clf.predict_proba(X_to_predict)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
My issue is neither nan
nor inf
values since:
np.isnan(X_to_predict).sum()
Out[147]: 0
np.isinf(X_to_predict).sum()
Out[148]: 0
Question: How can I convert X_to_predict
to values that are not too large for float32 while keeping as many digits after decimal point as possible?
If you inspect the dtype
of your array X_to_predict
it should show float64
.
# slightly modified array from the question
X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))
print(X_to_predict.dtype)
>>> float64
sklearn's RandomForestClassifier silently converts the array to float32
, see the discussion here for the origin of the error message.
You can convert it yourself
print(X_to_predict.astype(np.float32)))
>>> array([[137.09703 , 0. , -inf, 122.7038 ],
[137.09703 , -25.639154, 111.45788 , 137.09703 ],
[-25.639154, 98.189896, 122.7038 , -24.513906]],
dtype=float32)
The third value (-1.82710826e+296) becomes -inf
in float32. The only way around it is to replace your inf
values with the maximum of float32. You will lose some precision, as far as I know there is currently no parameter or workaround, except for changing the implementation in sklearn and recompiling it.
If you use np.nan_to_num
your array should look like this:
new_X = np.nan_to_num(X_to_predict.astype(np.float32))
print(new_X)
>>> array([[ 1.3709703e+02, 0.0000000e+00, -3.4028235e+38, 1.2270380e+02],
[ 1.3709703e+02, -2.5639154e+01, 1.1145788e+02, 1.3709703e+02],
[-2.5639154e+01, 9.8189896e+01, 1.2270380e+02, -2.4513906e+01]],
dtype=float32)
which should be accepted by your classifier.
Complete code
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
iris = load_iris()
clf = RandomForestClassifier(n_estimators=10,
random_state=42)
clf.fit(iris.data, iris.target)
X_to_predict = np.array([1.37097033e+002, 0.00000000e+000, -1.82710826e+296,
1.22703799e+002, 1.37097033e+002, -2.56391552e+001,
1.11457878e+002, 1.37097033e+002, -2.56391552e+001,
9.81898928e+001, 1.22703799e+002, -2.45139066e+001]).reshape((3, 4))
print(X_to_predict.dtype)
print(X_to_predict.astype(np.float32))
new_X = np.nan_to_num(X_to_predict.astype(np.float32))
print(new_X)
#should return array([2, 2, 0])
print(clf.predict(new_X))
# should crash
clf.predict(X_to_predict)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With